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ABSTRACT 

Forecasting  niunderstoms  over  a 2-  to  5-h  Period 
by  Statistical  Methods  (August  1977) 

Joseph  Allen  Zak,  B.S.,  M.S.,  Pennsylvania  State  University 
Chairman  of  Advisory  Connittee:  Dr.  Jeunes  R.  Scoggins 

Classical  statistical  techniques,  such  as  multiple  regression  with 
variable  selection  wd  principal  component  analysis,  were  enqployed  to 
define  combinations  of  pareuneters  from  meteorological  observations  which 
optimally  discrisdnate  between  the  occurrence  and  nonoccurrence  of 
thunderstorms.  Routine  observations  of  weather  elements  at  five  levels 
in  the  troposphere  during  two  spring  and  sumoier  seasons  were  analyzed 
cdsjectively  onto  a 65-km  grid  vdilch  spanned  much  of  the  centr^d  United 
States.  A thunderstorm  occurrence  was  defined  from  manually  digitized 
radar  (MOR)  observations  with  an  MDR  code  of  four  or  greater  as  the  basis. 
The  binary  variable  one  or  zero  for  occurrence  or  non-occurrence,  respec- 
tively, was  the  predictand.  Parameters  %«hich  are  measures  of  atmospheric 
moisture  content,  stability,  and  trigger  mechanisms  were  calculated 
frosi  gridded  fields  of  surface  and  ui^er-air  observed  elements  for 
different  tiiws  each  morning.  These  parameters  were  candidate  predictors 
in  the  variable-selection  procedures.  Data  from  all  grid  points  and  for 
each  day  were  pooled  in  order  to  provide  an  adequate  sample  of  thunder- 
storm observations 
Errors  which 

were  quantitatively  analyzed.  Nultioollinearity  was  severe  but  minimised 


from  usual  assumptions  in  a regression  model 


iv 


2 

through  stepwise  emd  naxinum  R variable  selection  techniques.  Speci- 
fication and  heteroskedasticity  errors  which  result  from  the  binaury 
natui-e  of  the  dependent  variable  were  present  but  did  not  invalidate 
the  overall  results. 

Hie  first  four  variables  selected  in  every  case  were  surface  mixing 
ratio,  occuTx'ence  of  precipitation  during  the  morning,  moisture  conver- 
gence, and  a stability  measure.  These  four  variables  include  the 
synoptic-scale  conditions  commonly  recognized  as  prerequisites  for 
thunderstorms.  The  trigger  mechanism  was  most  difficult  to  specify 
from  the  data,  followed  by  stability,  and  then  moisture.  Additional 
parameters  (up  to  17)  continued  to  reduce  the  total,  tinejqplained  variance 
of  thunderstorm  occurrence.  Time  changes  in  surface  parameters  were  not 
selected  as  leading  predictors.  Upper-air  observations  added  an  impor- 
tant ingredient,  the  stability,  trhich,  apparently,  could  not  be  inferred 
adequately  from  surface  measurements  alone. 

Data  were  grouped  by  surface  wind  coaqx>nent,  random  saiqpling,  and 
for  a spring  and  sumnwir  month.  About  one-third  of  the  data  was  saved 
for  a test  of  results.  Thunderstorms  were  more  predictable  between  2000 
and  2300  GMT  when  surface  winds  had  a northerly  ooiqponent  at  1800  GMT. 
Random  saspling  was  a way  of  reducing  the  influence  of  the  many  observa- 
tions of  no  thunderstorms  which  result  from  the  low  climatological  fre- 
quency of  occurrence.  Predictors  in  April  reflected  the  importance  of 
kinematics,  while  those  in  July  were  associated  with  thermodynamic 
variables  as  would  be  expected  from  synoptic-scale  data. 

Finally,  regression  statistics  with  the  predictand  being  occurrence 
of  thunderstorms  at  2000  to  2300  GMT  did  not  show  important  differences 


m 


V 


yAiBti  upp«r-air  paraasters  were  calculated  frca  observations  at  1200, 

1500,  or  1800  GMT.  However,  these  data  were  available  only  on  one  day, 

24  April  1975. 

The  results  fron  this  study  are  cosqparable  with  other  objective 
forecasts  and  with  those  produced  by  weather  station  forecasters  althou^ 
direct  coB^rlsons  are  difficult  to  stake.  Dtis  technique  can  be  applied 
rapidly  and  effectively  in  an  operational  environisent  at  locations  with- 
in the  developstental  area.  It  offers  all  the  advantages  of  an  objective 
forecast  and  contains  no  disadvantages  from  being  tied  to  specific 
forecast  siodels. 
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1.  INTRODUCTION 

a.  Statement  of  the  problem 

Thunderstorms  are  meteorological  phenomena  of  great  in^rtance  to 
meteorologists  because  of  the  energy  conversions  and  momentum  transports 
which  occur.  Manifestations  of  the  above  are  the  damaging  winds  2md 
hall  so  often  observed.  Unfortunately,  the  prediction  of  the  occurrence 
and  Intensity  of  these  storms  has  been  a problem  of  substantial  signifi- 
cance for  meteorologists  that  has  defied  easy  solution.  There  are 
several  reasons  for  this.  First,  a thunderstorm  remges  In  diameter  from 
a few  tens  to  one  hundred  kilometers  and  lasts  on  the  order  of 
10  to  10  seconds.  Such  mesoscale  f^enomena  elude  detection  by  most 
routine  observations.  Also,  the  analysis  and  forecast  schemes  that  are 
In  operational  use  are  applied  to  eureas  and  time  scales  much  greater  than 
these.  The  larger  scales  permit  only  a degree  of  success  In  predicting 
large  areas  In  tdilch  the  likelihood  of  thunderstorm  occurrence  Is  great 
(Fa%rcett,  1977) . Another  reason  Is  that  our  knowledge  of  the  dynamics 
and  thermodynamics  of  thunderstorms  Is  not  sufficient  to  explain  these 
phenomena.  Also,  the  precise  nature  of  the  Interactions  between  the 
large-  and  sstall-scale  circulations  is  not  sufficiently  well  understood 
(Barnes,  1976)  for  the  purpose  of  exact  forecasting. 

One  approach  to  the  solution  of  the  forecasting  problem  la  throu^ 
parameterlsatlon  of  large-scale  processes  and  use  of  appropriate  statis- 
tical techniques.  There  may  be  information  from  present  observations 
that,  when  used  In  certain  combinations,  can  improve  the  prediction  of 

The  citations  on  the  following  i>ages  follow  the  style  of  the 
Journal  of  Applied  Meteorology. 
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thunderstorms  over  a 2-  to  5-h  period.  For  periods  less  than  two  hours, 
persistence  and  radar  pattern  recognition  techniques  should  give  the 
best  results.  For  periods  beyond  5 h,  it  is  unlikely  that  observations 
will  reflect  the  structure  of  the  atisosphere  ifhich  produce  thunderstorms. 

Furthermore,  there  may  be  improvement  in  prediction  if  upper-air  data 
were  collected  at  more  frequent  intervals.  Finally,  optimum  combina- 
tions of  parameters  determined  by  statistical  techniques  may  lead  to 
improved  physical  models. 

The  hypotheses  underlying  this  research  eire  that  numifestations  of 
the  thermodynamic  and  hydrodynamic  interactions  which  evolve  into  in- 
tense convection  in  the  atmosf^ere  can  be  detected  in  routine  observa- 
tions and  that  these  observed  parameters  cem  be  used  in  a statistical 
BK>del  (which  minimizes  the  uneiq>lained  variance  of  observed  thunder- 
storms) for  prediction. 

b.  Previous  studies 

1)  Nature  of  thunderstorms 

Thunderstorms  occnir  in  comparatively  small  regions  in  the  atmosphere. 

Prior  to  1947  there  were  few  measurements  of  meteorological  variables 
in  and  near  thunderstorms,  so  that  circulation,  pressure,  temperature, 
and  moisture  patterns  «rere  Icnown  only  qualitatively.  With  the  realiza- 
tion of  the  Thunderstorm  Project  (Byers  and  Braham,  1949),  however,  our 
quantitative  knowledge  increased  significantly.  Measurements  collected 
over  a 2-y  period  established  the  horizontal  and  vertical  structure  of 
many  meteorological  variables  associated  with  thunderstorme  and  confirmed  ) 

the  existence  of  multiple  convective  cells  in  various  stages  of  develop- 
ment. 
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Scorer  and  Ludlam  (1953)  proposed  a bvibble  theory  of  convection 
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that  explains  many  of  the  observed  features  of  a growing  convective 
element.  In  this  concept,  the  kinematics  resemble  those  of  a sf^ierical 
vortex,  as  discussed  by  Woodward  (1959)  emd  Turner  (1964) . Later  stages 
better  resemble  a jet  of  upward-moving  air  (Squires  and  Turner,  1962) 
which  exists  in  nearly  steady  state,  particularly  in  the  presence  of 
vertical  wind  shear.  Ludlam  (1963)  discussed  the  role  of  the  tilted 
\q>draft  core,  a manifestation  of  wind  sheetr,  as  a natural  way  to  shield 
the  updraft  that  generates  energy  from  the  destructive  influences  of 
precipitation-induced  downdrafts  euid  environmental  entrainment. 

Recent  meteorological  literature  contains  many  articles  concerning 
thunderstorms,  their  interactions,  intensification,  movement,  and  struc- 
ture.  It  is  not  our  purpose  to  review  these  in  detail,  but  the  follow- 
ing synopsis  will  point  out  the  complexity  of  thunderstorms  £md  environ- 
mental interactions  with  which  we  must  be  concerned. 

Thunderstorms  grow  from  a few  kilometers  in  dieuneter  to  large, 
quasi-steady  supercells  20-50  km  in  diameter  (Browning  emd  Ludlam,  1962)  . 
niey  can  last  from  30  min  to  many  hours.  Such  storms  may  or  may  not 
spawn  tornadoes,  rotate,  contain  destructive  downdrafts  or  hail,  or  exist 
in  strong  wind  shear.  Even  the  sin^>le  cumulus  source  is  not  simple  at 
all,  as  pointed  out  by  Auer  (1976)  from  his  observations  of  distortions 
of  0 fields  near  a cloud  boundary.  The  entraining  plume  model  falls 
short  of  describing  the  thunderstorm  documented  by  Saunders  and  Paine 
(1975) . In  this  severe  supercell  there  was  little  downdraft  at  the 
surface,  but  a masoscale  updraft-downdraft  doublet  aloft  seaawd  to  per- 
mit vertical  motions  to  persist  for  several  hours  without  large 
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perturbations  In  isentropic  surfaces.  Lemon  (1976)  discusses  a flanXing 
line  thunderstorm  which  includes  both  multicell  and  supercell  storms 
that  derive  inpetus  from  entrainment  of  flanking  cells.  Still  auiother 
category  termed  "spearhead  echo"  by  Fujita  and  Byers  (1977)  has  in- 
tense destructive  downdrafts  which  appe2u:  to  be  tied  to  overshooting 
tops  of  clouds  at  the  anvil  level.  Finally,  a fascinating  observation 
that  "the  growth  of  vigorous  squall  lines  and  severe  weather  are  shaurp- 
ly  inhibited  at  emd  to  the  south  of  the  subtropical  jet"  is  documented 
2md  explained  by  Whitney  (1977) . 

As  new  mesoscale  observational  tools,  such  as  Doppler  radar  and 
storm  satellites  (Shenk,  et  ^. , 1976),  are  added  to  our  operational 
inventory,  we  are  likely  to  observe  even  more  differences  among  thunder- 
storms. Now,  we  have  observations  of  internal  motions  within  cells  from 
experimental  Doppler  radar  (see,  for  example,  Brandes,  1977;  Kropfli 
and  Miller,  1976) . Complicated  motion  patterns  of  outflow  aloft  and 
jet  stream  interaction  can  be  observed  from  stationeu:y  satellite  picture 
composites.  The  intricate  details  of  overshooting,  which  seem  to  be 
linked  to  tornado  formation,  can  be  seen  from  satellite  film  loops  as 
well. 

There  is  no  "typical"  thunderstorm.  Each  storm  is  unique  in  many 
respects.  It  is  highly  unlikely  that  identical  environmental  impulses 
exist  on  different  days  or  even  in  different  locations  on  the  same 
day.  It  is  not  surprising  that  modelers  and  forecasters  have  much 
difficulty  in  their  tasks  of  understanding  and  forecasting  these  phe- 
ncawna. 


Concerning  the  environment,  we  know  that  conditions  necessary  for 


severe  convective  developinent  Involve  a)  convective  instability  and  a 
lifting  mechanism  to  release  it,  b)  2d}undant  low-level  moisture  over 
which  a dry-air  intrusion  exists,  and  c)  bands  of  strong  winds  in  the 
lo%ier  and  upper  levels  (Miller,  1972) . For  less  severe  storms  this 
list  reduces  to  moisture,  potential  instability^,  and  a trigger  . Hiese 
conditions  must  be  identified  through  existing  sieteorological  data  net- 
wor)is  and  numerical  prognoses. 

2)  Forecasting  procedures 

Present  forecasting  procedures  are  somewhat  subjective,  and  there- 
fore strongly  influenced  by  a person's  knowledge  and  experience.  As 
these  vary  with  individuals  who  tend  not  to  stay  at  one  location, 
thunderstorm  forecasting  procedures  for  a given  point  are  highly  vauri- 
able.  A typical  forecast  involves  1)  a study  of  the  existing  and  paust 
large-scale  weather  patterns  with  entasis  on  the  location  of  discon- 
tinuities and  features  discussed  in  the  preceding  paragraph,  2)  an 
analysis  of  stability  of  the  atmosphere  from  the  nearest  and  latest 
upper-air  sounding,  3)  evaluating  the  latest  available  numerical  fore- 
C2ists  and  interpolating  for  a given  time  and  location,  4)  a closer  look 
at  the  local  treather  and  hourly  chemges,  particularly  from  surface 
observations  and  radar,  and  5)  a decision  on  whether  or  not  all  the 
ingredients  for  thunderstorms  will  exist  at  the  station  for  the  future 
time  in  question.  This  last  step  requires  synthesizing  all  the  data 
from  the  previous  steps. 

Objective  techniques  offer  several  advantages.  They  do  not  require 

^Defined  by  Palmen  and  Newton  (1969,  p.  345)  to  include  both  con- 
vective and  conditional  instability. 
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extensive  personal  experience;  they  can  synthesize  a great  cunount  of 
data  rapidly  and  effectively;  they  can  be  automated.  Furthermore,  they 
cem  be  developed  to  make  maximum  use  of  historical  observations.  Finally, 
established  rules  for  parameterizations  can  be  followed. 

There  are  three  steps  in  a parameterization  approach.  First,  one 
must  know  the  processes  (equations)  involved.  Next,  relev2mt  parameters 
must  be  combined  in  an  appropriate  functional  relationship.  Finally, 
one  must  test  the  results.  A more  detailed  description  of  parameteriza- 
tion techniques  is  given  in  the  Global  Atmospheric  Research  Programme 
(GARP)  Publication  No.  8 (1972) . 

A statistical  approach  to  thunderstorm  forecasting  is  used  partly 
to  alleviate  the  disparity  between  the  lack  of  understanding  and  the 
need  for  prediction,  partly  to  glean  as  much  information  as  possible 
from  existing  data,  and  partly  to  gain  the  benefits  of  objective  fore- 
casting schemes.  Hie  forecasting  of  mesoscale  phenomena  by  statistical 
techniques  is  not  new.  Persistence  probability  has  aided  the  operational 
forecaster  in  predicting  changes  in  ceiling  emd  visibility  as  well  as 
the  onset  and  duration  of  critical  values  of  meteorological  variables. 
Endlich  and  Mancuso  (1968)  combined  a number  of  measured  atmospheric 
quantities  into  several  kinematic  6md  thermodynamic  parameters  which 
were  correlated  with  severe  thunderstorms  2md  tornadoes.  Similarly, 
observational  data  were  used  in  an  objective  (statistical)  procedure 
to  forecast  severe  thunderstorms  2uid  tornadoes  by  Miller  and  David  (1971) . 
Probability-of-precipitation  forecasts  and  other  model  output  statistics 
have  been  available  for  several  years  (Glahn  and  Lowry,  1972) . More 
recently,  24-h  forecasts  of  probabilities  of  thunderstorms  and  severe 
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thunderstorms  have  become  available  from  the  National  Weather  Service 
(AlaXa  et  al. , 1973).  In  these  procedures,  various  potential  predictors 
from  numerical  forecasts  were  used  in  a screening  regression  program. 
Those  predictors  selected  account  for  a certain  fraction  of  the  total 
variance  of  observed  thunderstorms  as  derived  from  historical  manually 
digitized  radar  (MOR)  data  (Moore  £t  a^, , 1974).  Finally,  a statistical 
regression  forecast  for  severe  thunderstorms  2 to  6 h in  the  future  also 
recently  became  available  (Charba,  1975) . General  thunderstorm  fore- 
casts were  added  during  the  spring  of  1976,  and  other  improvements  were 
made  in  1977  by  Charba  (1977)  (see,  also,  the  National  Weather  Service 
Technical  Procedures  Bulletin  194) . 

In  these  latter  procedures  predictors  were  derived  from  surface  ob- 
servations and  dynamic  model  forecasts.  An  advantage  to  the  use  of 
parameters  from  forecast  models  is  that  the  physics  of  the  circulation 
system  is  included.  A disadvantage,  however,  is  that  changes  to  the 
model  necessitate  development  of  new  equations,  as  the  old  regression 
equations  apply  only  to  variables  calculated  from  the  former  model. 
Another  disadvantage  is  that  inaccuracies  in  the  forecasts  will  limit 
the  degree  to  which  the  model  can  describe  the  predictand.  Finally, 
predictors  lose  their  simple  interpretation  in  that  forecast  elements 
include  biases  from  the  model.  In  this  research  the  disadvantages  are 
eliminated,  and  the  physics  will  be  included  to  the  greatest  extent 
possible  in  the  choice  of  candidate  predictors. 

c.  Objectives 

Within  the  general  fraswwork  of  developing  a statistical  model  to 
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forecast  thunderstonss  in  a 2-  to  S-h  period  will  be  the  following 
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objectives:  developing  parameters  for  ceindidate  predictors  that  are 
consistent  with  known  physical  processes,  parameterization  methods,  and 
interactions  between  systems  of  different  scale,  relating  various  test 
statistics  to  available  verifications  of  existing  thunderstorm-forecast- 
ing methods,  developing  a way  to  use  the  spatial  variation  of  meteoro- 
logical variables  to  best  advantage  when  many  independent  variables  are 
involved,  interpreting  statistical  results  in  terms  of  violations  of 
model  assumptions,  assessing  the  influence  of  upper-air  observations 
availeUale  at  3-h  intervals,  and  finding  optimum  times  for  the  dependent 
variable  and  time  changes  for  selected  predictors. 

This  research  will  extend  the  work  of  Charba  (1977)  and  others  in 
several  important  ways.  First,  different  statistical  models  will  be 
evaluated  such  as  principal  component  analysis,  variable  selection,  and 
discriminant  analysis.  Analysis-of-variance  statistics  will  be  examined 
along  with  plots  of  key  parameters  to  determine  the  magnitudes  of 
errors  due  to  assumptions  made  in  the  models.  Secondly,  the  final  model 
will  be  tested  on  em  independent  data  sample.  These  statistics  will 
be  related  to  actual  verifications  of  thunderstorm  forecasts.  Thirdly, 
upper-air  observations  will  be  employed  and  their  importance  to  observed 
(by  radar)  thunderstorms  assessed.  A unique  set  of  upper-air  data 
collected  during  atmospheric  variability  experiments  (Fucik  and  Turner, 
1975)  will  permit  calculations  of  upper-air  parcuneters  every  three  hours 
for  one  day.  These  data  are  available  usually  at  12-h  intervals. 
Finally,  potential  predictors  will  be  calculated  from  the  observed 
variables  in  a way  which  will  minimize  intercorrelations  which  exist 
naturally  in  this  type  of  data. 
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As  long  as  a short-period  forecasting  requirement  exists,  meteo- 
rologists must  strive  to  produce  the  best  forecasts  possible.  This 
research  will  contribute  to  that  goal,  and  may  also  aid  in  the  underly- 
ing goal  of  understanding  the  complex  interactions  of  atmospheric 
parameters  which  culminate  in  thunderstorms. 

d.  Importamce 

Meteorological  data  networks  and  numerical  forecasting  techniques 
are  established  for  the  synoptic  scale  of  atmospheric  analysis  amd  pre- 
diction. A true  mesoscale  data  network  is  prohibitively  costly  and 
could  not  be  handled  with  present  computer  systems.  Until  new  observa- 
tional tools  such  as  Doppler  radar  and  geosynchronous  satellites  are 
perfected  and  automated,  we  are  constrained  in  making  point  forecasts 
of  mesoscale  phenomena  such  as  thunderstorms  with  present-day  data. 

These  data  consist  of  1)  hourly  surface  reports  from  stations  spaced 
approximately  150  km  apart,  2)  hourly  radar  reports  manually  digitized 
from  a network  in  the  eastern  two-thirds  of  the  United  States,  3)  satel- 
lite photographs  at  30-min  intervals  available  at  selected  locations, 
and  4)  12-h  upper-air  observations  from  stations  spaced  approximately 
300  km  apart.  Our  task,  then,  must  be  to  extract  as  much  information 
as  possible  from  these  data.  This  is  made  more  realistic,  physically, 
by  the  postulate  that  the  energy  required  to  initiate  the  development 
of  mesoscale  systems  is  contained  within  the  synoptic-scale  systems 
(Global  Atmospheric  Research  Programme,  1972,  p.  1) . 
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2.  STATISTICAL  APPROACH 


The  theory  of  classical  statistical  methods  such  as  least  squetres 
and  regression  analyses  is  well  documented  (see,  for  example.  Draper 
and  Smith,  1966;  Morrison,  1976;  Neter  and  Wasserman,  1974) , and  will 
only  be  presented  here  to  the  extent  necessary  to  facilitate  discussions 
of  model  assumptions,  variable  selection  techniques,  and  results. 

Errors  resulting  from  violations  of  model  assumptions,  and  also  from 
use  of  a binary  dependent  variable  and  intercorrelated  independent 
variables  will  be  presented.  We  will  conclude  with  discussions  of  the 
interpretation  of  results  for  a regression  model  and  principal  component 
analysis. 


a.  Linear  models 

Since  the  exact  form  of  relationships  between  dependent  and  inde- 
pendent variables  is  unknown,  a comnon  assun^tion  (and  good  starting 
point)  is  that  of  a linear  relationship  of  the  form 

(1) 


y. 

1 


^0  * * ^2\2  * * ^i’ 


In  this  study,  Y^,  the  dependent  varicible,  indicates  a yes -no  occurrence 
of  thunderstorms  for  a given  time  interval  during  a day  and  given  com- 
bination of  grid  points  by  assuming  values  of  one  and  zero,  respectively. 
The  independent  variables,  x's,  are  obtained  fran  the  measured  or 
emalyzed  observations.  The  error  or  residual  term,  e^,  is  due  to  the 
fact  that  the  occurrence  of  thunderstorms  cannot  be  precisely  predicted. 
The  ere  the  partial  regression  coefficients  which  relate  observed 

conditions  to  the  occurrence  of  thunderstorms.  These  coefficients 


are  estimated  from  the  data  so  as  to  minisdze  the  sums  of  squared 
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differences  between  actual  and  estimated  values  of  the  dependent  vari- 
^U^le.  Estimates  of  the  are  denoted  by  0^.  ihis  latter  procedure 
amounts  to  minimizing  the  following: 

“ ^0  ■ ^l*il  " •••  * • <2) 
This  term  is  called  the  sum  of  squares  of  the  errors  or  SSE.  Differen- 
tiating (2)  with  respect  to  S_f  8, f ...  8 and  setting  each  equal  to 

u X in 

zerOf  we  get  a set  of  Normal  Equations  which  can  be  written  in  matrix 
notation 


(X'X)6  - X'Y,  (3) 

vdiere  capital  letters  are  matrices,  underlined  terms  are  vectors  and  a 
prime  denotes  the  transpose  of  a matrix.  Here  (X'X)  is  the  sum  of 
squares  and  cross  products  of  all  independent  varied^les  amd  is  called 
the  variance-covariance  matrix  since  we  are  dealing  with  corrected  (mean 
subtracted  from  each  observation)  values.  From  (3)  one  can  see  that  ^ 
can  be  obtained  by  multiplication  of  X'Y  by  the  inverse,  (X'X)  The 
partial  regression  coefficients,  B^'s,  indicate  the  change  in  Y associ- 
ated with  a unit  change  in  x while  all  other  x's  remain  constant.  The 
fact  that  Y is  a binary  variable  ma)(es  no  difference  in  these  calcula- 
tions . 


b.  Partitioning  sums  of  squares 

The  statistical  analysis  continues  by  partitioning  sums  of  squares 
in  the  fashion  of  analysis  of  variance  (ANOVA)  to  determine  the  signi- 
ficance of  the  analyses  as  a whole  as  well  as  that  of  individual 
coefficients.  Ihe  total  (corrected)  sums  of  squares  can  be  partitioned 


as  follows: 
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The  term  on  the  left  is  simply  n times  the  variance  of  Y or  total  sum 
of  squares  (SST) . Ilie  first  term  on  the  right  is  tht  sum  of  squared 
deviations  of  observed  data  from  estimates  based  on  the  model.  It  is 
the  residual  sum  of  squares  of  the  errors  (SSE) . The  last  term  repre- 
sents the  sum  of  squared  differences  between  the  model  estimates  and 
estimates  when  no  model  is  assumed.  This  is  usually  called  the  sum  of 
squares  due  to  regression  (SSR) . A mean  square  regression  (MSR)  and  mean 
square  error  (MSE)  are  obtained  by  dividing  SSR  and  SSE  by  their 
respective  degrees  of  freedom.  The  p2u:titioning  is  sumnarized  in  Table 
1.  The  ratio  of  MSR/MSE  forms  the  basis  for  the  statistical  F-test  for 


Table  1.  Analysis  of  varieuice. 


Source  of 
variation 

Degrees  of 
freedom 

Sum  of 
Squares 

Meeui 

Squares 

F 

Total 

n-1 

SST-Z (Y-Y) ^ 

“ 

— 

Regression 

m 

SSR-gX'Y 

SSR/m 

MSR/MSE 

Residual 

n^B-1 

SSE-SST-SSR 

SSE/n-ra-1 

— 

an  hypothesis  that  there  is  no  linear  relation  or  that  3*0.  Another 

ratio  used  in  regression  analysis  is  the  ratio  of  the  sum  of  squares  of 

regression  to  the  total  sum  of  squares,  SSR/SST.  This  quantity  is 

somstistes  called  the  coefficient  of  detenslnation  and  its  symbol  is 
2 2 

R . Me  can  interpret  R as  the  fractional  amount  of  total  variance 
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accotinted  for  by  the  linear  combination  of  variables.  The  significance 
of  individual  partial  ^'s  can  also  be  examined,  but  can  be  misleading 
when  x's  are  interrelated;  that  is,  when  it  is  impossible  to  vary  one  x 
emd  hold  all  others  constant.  This  problem  will  be  examined  in  para- 
graphs d and  e. 

c.  Model  assuiaptions  and  violations 

1)  A linear  model  correctly  describes  the  data. 

The  correct  model  is  not  ]cnown.  Even  if  the  model  is  of  the  form 

in  (1) , which  parameters  should  be  included?  Variable  selection  tech- 

2 

niques  aid  in  this  choice  but  do  not  guarantee  that  the  best  subset 
has  been  chosen. 

Within  the  framework  of  linear  regression  non-lineau:  predictors  are 
included.  Linear  regression  refers  to  linear  parameters  (§^'s) , not 
linear  independent  variables.  It  is  unlikely  that  all  predictors  are 
exactly  linearly  related  to  the  occurrence  of  thunderstorms.  Fortu- 
nately, in  a rather  broad  range  for  many  predictors,  the  linear  approxi- 
mation is  representative  of  the  association  between  dependent  and 
independent  variables.  We  can  linearize  them,  if  we  choose,  by  replacing 
the  original  variable  by  a transformed  version  more  nearly  linearly 
related  to  the  predictand;  however,  we  are  not  sure  about  its  behavior 

when  it  coexists  in  the  model  with  other  predictors.  In  several  attempts 

2 

to  linearize  predictors,  the  overall  improvement  in  R was  less  than 
3.0%.  Also,  once  predictors  are  linearized,  the  equations  are  PK>re 

2 

Best  or  optimum  refers  to  the  maximum  possible  reduction  of  vari- 
ance that  can  be  achieved  with  the  given  linear  combination  of  variables. 
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difficult  to  use  in  an  operational  environment.  Finally,  other  errors 
to  be  discussed  next  appear  to  be  more  serious.  Therefore,  lineariza- 
tion was  not  pursued  in  this  research. 

2)  The  x's  are  measured  without  error. 

We  know  that  there  are  errors  in  measuring  all  veuriables.  Not  only 
are  there  errors  in  measuring  basic  variables  such  as  temperature  eund 
wind,  but  also  there  are  errors  due  to  finite  difference  approximations. 
Unfortunately,  the  original  data  spacing  and  amalysis  procedure  limit 
the  smallest  space  interval  for  which  unique  information  is  availcdDle. 
Measurement  error  is  not  a problem  in  this  study  because  it  is  small 
compared  to  the  total  variance  of  the  x's.  For  example,  the  ten^rature 
error  may  be  0.5  K whereas  the  r^ulge  of  temperature  may  sp£ui  50  K. 

3)  The  values  of  £ are  independent,  random,  normally-distributed 
variables  with  a mean  of  zero  and  constemt  variemce. 

This  term  is  estimated  by  residuals  or  differences  between  observed 
and  predicted  values  from  the  computed  linear  function.  Each  item  will 
be  discussed  separately. 

Independent  £:  Meteorological  variables  are  functions  of  time; 
however,  the  time  dependency  in  our  case  is  somewhat  masked  because  we 
input  data  from  a sequence  of  36 , 30 , 30 , 36 , ...  grid  points  for  suc- 
cessive days.  In  other  words,  day  1 contains  36  data  points;  days  2 
and  3 contain  30  points;  day  4 contains  36  points,  etc.  This  scheme  is 
a consequence  of  the  data  input  algorithm  and  remained  the  same  for  all 
days  in  this  study.  Also,  which  time  dependence  (one  day,  two  days, 
etc.)  is  important?  This  dependence  probably  changes  %fith  different 
synoptic  situations,  and  the  overall  effect  is  masked  by  other  problems 
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to  be  discussed. 

Randomly  distributed  £:  A correctly  specified  model  should  show 
residuals  vdiich  axe  random  when  plotted  against  an  independent  variable. 
While  plots  show  definite  non-randcxnness,  it  is  most  likely  due  to  many 
peculieurities  resulting  from  a dichotomous,  dependent  variable.  These 
will  be  discussed  in  paragraph  5.  Violations  of  the  assun%>tion  that  t 
is  randomly  distributed  are  called  specification  error  euid  result  from 
not  knowing  the  correct  model  form  and  not  including  the  correct  vari- 
ables. The  residual  sums  of  squares,  SSE,  is,  therefore,  inflated  and 
estimates  of  regression  coefficients  may  be  biased.  There  is  no  good 
way  of  dealing  with  this  problem  except  to  recognize  possible  nonlinear- 
ities in  predictors  and  include  physically  relevemt  parameters.  We  are 
naturally  constrained  in  this  latter  work  by  our  fixed  observational 
networks . 

£ of  constant  variance,  mean  of  zero:  It  is  assumed  that  £'s  are 

2 

from  a single  population  with  zero  mean  emd  variance  a . The  mean  is 
zero  but  variance  is  a function  of  the  x's  due  to  the  nature  of  the 
predictand  in  the  sample  used  in  this  study.  This  error  is  termed 
heteroskedasticity . 

Next,  we  will  consider  these  last  few  problems  in  more  detail. 

5)  Special  problems  for  a dependent,  binary  V2u:iable. 

In  addition  to  the  error  of  specification,  there  are  several  prob- 
lems unique  to  the  use  of  a binrry  dependent  variable.  The  first  and 
B»at  c^ious  is  that  the  error  term  can  assume  just  two  values  depending 
on  whether  the  predicted  value  is  subtracted  from  zero  or  one.  Pig.  1 
is  a plot  of  residuals  for  a typical  predictor,  W,  tdilch  is  positl>’ely 


RESIDUAL  FROM  REGRESSION  MODEL  (RY) 


A A 

A AA  A A 

AC  AC  AA  A HA  A rS  A 

A AAO  /AA6AA  CC-AA^AA  R AAA 
AA  c C cec  0 Cf  CeOtJACFGCA  AA  AAA  a 
ccccccrcA^CcCi-ccOMorrAEEe  dpaab 

A AACCCOCCOHhMJl'^I^CFC'^OOCAO  (?AArt 
AAee  CRC»*r  c JCQh  I^wk  JNJf  FEOACrt'JA 

C AP€CCCe€LCeCL  JKMCKK  AO  A A 

AA  CACAFirCCPCHMJHl  RLG  JML,rdOA  A 
A AA  A AAA  AC  CCOOP  ^LHCIGOf  i’OPCAf?  A 

A A » AAA  OACACAEtOOCC  0 

A8  AA  PAA  A A AP  AAA  009  A AO  A 
AAAA  AA  AAAAO  AA  A A A AO  00  A A 

A AAA  A AO  CAPO  AA  P AAA  A«  A 

B ABCA  AC  AEAAAOA  OPP  B BA 

AAA  C AA  OrHIIRCB  EACA  AA 

A ApA  A Acr'bc^ce  lOOC A A A 

A CA  A CAdOACDCOOCPA AAAA 


0 OOBAAA  A CA  A < 

CEBrrucTACE  auaaaaa  a n 

oooogknvmn* jE I ccccc cesccc  a a 

FFMMli\tjX*Jlvrx  vrc  CX  VTBMCMOPE  CPACAAA  A AA 
CHGIIi^UAT/A*.7fci27*’SJJ/*’  XCC  I JH  jfHCC  P AO  ABA  A 


CAdOACDCOOCPA AAAA 
A 0 PAOr A A APAAAAA  A 
A AUABA 

AA  AAA  AAA  A 
A A A PA 


AA  f B'iWXPOWH/ /U»  2 '27^^  Z7?/Y  ZOC  JlGCHCHC  CBEAA  PE 

A ni}Sbn^K}'Of~ri  TYVzutvr/  ^ / 2/2»wnmp  jk^*  i cgffd  a 

CC  r;ACGtCMat;HO*-KPETX222/t22/2/Z2ZW2KCMLOOFnnc  AO  A 
A A r AtOOLL  J I JU’^OZ  722  2 2 i Z2  ;/vrrz  ICMH^-GCC  A 
ccee  AA.)CI(  iNVK  » Y/2/7  2.'2.2Z  rfiTSQNLFpr  A A 
A A PA  AC'JPOEJH  JKLI  AXfZTZV/.'ZVVtr  JfAi:  A 

A A A AOCbFF  Jitf  Art.G<  I M0SN:>NK  t JC 

A AO  AAA  AAAE  C ApeCcOP  F | F GflJHAE  OP 

A A AAA  A AA  A AA  A EC  AAGCACEnCHCA 

A A a A e A BA  AAAA  A PAaHA  Atl 

A AABA  A G AB  AA  A 6A 

AAA  A ee  ABE  C ApAB  A A AA  A 

A A AA  A ACHAU  AA 

A*  1 ODSKKVATION  4 ACACBAOAB  0 A A A 

B-  2 OBSEir/ATIONS  4*  A 4 *ec 

A A AB 

t-26  OBOFRVATIONS  * 


SURFACE  MIXING  RATIO  (W)  (g  kg  ') 

PI9.  1.  Plot  of  mixing  ratio  versus  residual  from  regression  model 
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correlated  to  thunderstorm  occurrence.  Each  letter  represents  the 
number  of  observations  corresponding  to  its  position  in  the  alphabet. 

A is  one  observation;  Z represents  26  observations.  Errors  are  clus- 
tered around  small  negative  values  (tdien  Y is  zero)  and  medium  positive 
values  (when  Y is  one  and  the  predicted  values  are  weighted  tow2urd 
zero  due  to  the  influence  of  all  the  zero  observations) . enviously, 
the  assumption  of  normality  is  not  valid.  The  second  problem  is  that 
the  variance  of  is  a function  of  (Neter  and  Wasserman,  1974) . 
Finally,  since  Y^  is  similar  to  a probability  of  occurrence^,  this 
number  should  lie  between  zero  and  one.  The  regression  response  func- 
tion does  not  automatically  possess  this  property.  Fig.  2 is  a plot 
of  residuals  versus  predicted  values  for  the  dependent  sample.  Predicted 
values  range  from  -0.2  to  1.2,  but  the  mean  is  about  0.15. 

Concerning  the  first  problem,  even  though  error  terms  are  not 
normal,  the  least  squares  procedure  still  provides  unbiased  estimates. 
Further,  when  s;ui^le  sizes  are  large,  the  distribution  of  estimates  is 
asymptotically  normal  so  that  Inferences  concerning  the  regression 
coefficients  and  mean  responses  can  still  be  made.  Variable  selection 
procedures,  then,  can  still  produce  satisfactory  results  though  little 
mention  of  "significance"  will  be  made  in  this  work.  The  second  problem 
can  be  dealt  with  through  weighted  regression  (Neter  and  Wasserman, 

1974) . Heights  are  assigned  to  observations  in  such  a way  that  re- 
sponses or  predicted  values  near  zero  or  one  receive  maximum  weight. 

^We  are  trying  to  predict  an  occurrence  which  is  represented  by  a 
one  or  a non-occurrence  represented  by  zero  in  a continuous  fashion. 


KESiOUALS 
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iMs  type  of  regression  was  not  performed  because  the  observations  of 
thunderstorms  are  already  weighted  by  virtue  of  low  climatological 
frequencies  of  thunderstorm  occurrence.  An  attempt  was  made  to  deal 
with  the  problem  through  inclusion  of  random  sauries  of  no-thunderstorm 
observations  cmd  through  prior  screening  of  no-thunderstorm  cases  by 
critical  values  of  selected  predictors.  The  problem  of  predicting 
less  than  zero  or  greater  than  one  is  not  particularly  serious  since 
the  threshold  for  forecasting  thunderstorms  from  predicted  values  is 
arbitrary.  Nevertheless,  it  appears  that  fitting  a logistic  function 
such  as 

Y - {exp{-10.0  + 0.1x))/(l  + exp(-10.0  + O.lx))  (5) 

would  eliminate  this  problem.  Such  a function  is  shown  in  Fig.  3 for 
one  independent  variable. 

It  can  be  linearized  by  the  simple  transformation, 

Y*  - In  (Y/(l-Y)).  (6) 

Special  precautions  are  required  for  zero  predicted  values.  Note  that 
her.:,  too,  added  weight  is  given  to  both  near-zero  and  one  predictors. 
Glahn  and  Bocchieri  (1975)  used  a similar  function  in  an  objective 
forecasting  scheme  and  found  difficulties  in  some  cases  due  to  the 
syoBetric  nature  of  the  curve  and  poor  fit  near  the  threshold  probabil- 
ity for  yes-no  forecasts.  Also,  fitting  this  function  is  not  easy 
unless  there  are  repeat  observations  for  each  level  of  x.  Such  is  not 
the  case  with  the  data  used  in  this  research. 

d.  Multicollinearity 

Another  more  serious  problem  results  from  use  of  interrelated 
predictors  (x's).  The  x's  are  in  fact  related  in  at  least  three  ways. 
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First,  the  basic,  measured  variables  are  related  through  physical  laws 
and  relationships  such  as  the  gas  law,  first  law  of  thermodynamics,  or 
thermal  wind  equation;  therefore,  parameters  derived  from  the  basic 
variables  are  related.  This  problem  is  usually  exaggerated  by  using 
the  basic  five  surface  variables  and  five  upper-air  variables  (the 
latter  for  each  of  four  chosen  pressure  levels  in  the  troposphere) , and 
computing  up  to  35  parameters  for  more  th^m  twice  as  many  points  in 

4 

space  as  there  are  original  data  measurements  . Also,  several  measures 
of  the  same  basic  dimension,  say  stability,  are  calculated  because 
the  best  measure  of  stability  is  not  Itnown.  Therefore  , many  more 
variables  than  we  need  are  included.  Secondly,  variables  are  related 
in  space  for  many  hundreds  of  kilometers.  The  very  concept  of  an  air 
mass  suggests  a dependence  for  nicuiy  variables.  Finally,  there  is  a 
time  dependence  in  that  meteorological  variables  on  one  day  are  cor- 
related to  those  on  the  next  day  (or  longer) . 

■nie  problem  of  intercorrelated  "independent"  variables  is  called 
multicollinearity  and  for  data  in  this  research  is  severe  enough  to  pre- 
vent us  from  calculating  (X*X)  ^ since  near-singularities  exist^. 
Therefore,  we  must  use  a variad^le-selection  technique  to  be  discussed 
in  Section  2e  or  a principal  component  analysis  discussed  in  Section  2g. 
When  an  inverse  can  be  computed  and  the  model  is  correct,  then  the 
regression  coefficients  estimated  by  the  least  squares  technique  are 

4 

This  is  a consequence  of  the  emalysis  schemes,  and  the  price  paid 
for  trying  to  preserve  as  much  detail  as  possible. 

generalized  inverse  can  be  calculated;  however,  the  estimates 
of  the  coefficients  would  be  biased. 
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unbiased.  Ibis  means  that  the  expected  values  computed  from  repeated 
samples  will  approach  the  correct  value  in  the  mean.  The  space-cor- 
relation problem  is  reduced  by  use  of  every  fourth  grid  point  in  the 
statistical  analyses. 

e.  V^u:iable  selection  methods  and  inference 

Fortunately,  the  regression  analysis  is  robust  in  that  even  moderate 
deviations  from  the  assumptions  do  not  invalidate  results.  The  signi- 
ficant problem  of  multicollinearity , however,  can  have  severe  influences 
(even  critical  in  our  case  with  all  variables  where  the  X'X  matrix  is 
near  singular).  Of  many  interrelated  variables,  which  should  be  )cept 
in  the  model?  To  deal  with  this  problem,  four  different  variable 
selection  techniques  were  used  in  this  study;  all  try  to  choose  subsets 

of  predictors  which  minimize  the  residual  mean  square  (MSE) . Ihey  are 

2 

forward  selection,  baclcward  elimination,  stepwise,  and  maximum  R 
improvement. 

1)  Forward  selection 

This  procedure,  often  called  step-up,  begins  by  choosing  that 
variable  which  is  most  highly  correlated  with  the  dependent  variable. 

The  second  variable  is  chosen  by  seeking  the  next  most  highly  correlated 
of  the  remaining  independent  variables  with  the  dependent  variable, 
according  to  the  partial  correlation  coefficient.  In  other  words,  for 
each  remaining  independent  variable,  a partial  F-statistic  is  calculated 
that  reflects  that  variable's  contribution  to  the  model  were  it  to  be 
included.  If  this  statistic  for  one  or  more  variables  has  a "signifi- 
cance level”  greater  than  a specified  amount  (O.SO  is  used  in  this 
study),  then  the  variable  with  the  largest  F is  included.  This  process 


r 
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is  repeated  and  varicUslcs  added  one  at  a time  until  none  passes  the  F- 
test  or  no  more  remain.  Once  a varieible  is  added  to  the  model,  it 
must  remain  vdiether  or  not  its  influence  is  negated  by  other  variables 
added.  This  procedure  is  likely  to  give  near  optimum  few-Vcu:ic±>le 
models,  but  deteriorates  as  more  are  added. 

2)  Backward  elimination 

In  this  technique,  also  called  step-down,  the  model  with  all  vari- 
cd3les  is  considered;  then  variables  are  deleted  one  at  a time  starting 
with  the  one  whose  8 exhibits  the  lowest  F-statistic.  Here,  we  are 
likely  to  get  optimum  many-variable  models  but  poor  results  when  more 
and  more  variables  are  deleted  since  they  can  never  be  included  again. 

3)  Stepwise 

This  procedure  is  a refinement  of  forward  selection.  At  each  step 
before  determining  the  next  variable  to  be  added,  the  F-statistics  are 
I checked  for  the  coefficients  already  chosen  to  see  if  any  should  be 

deleted  based  on  cinother  prespecified  "significance  level"  (in  oui  case 
J 0.1) . Only  after  this  check  for  deletion  is  nade  can  another  variable 

I be  added.  The  procedure  terminates  when  no  partial  F is  > 0.5  or  a 

variable  to  be  added  is  one  just  deleted.  This  procedure  is  most 
I appealing  so  far;  but,  still  an  optimum  subset  is  not  guaranteed  (Draper 

£uid  Smith,  1966) . Stepwise  is  the  predominant  procedure  used  in  this 
1 research. 

2 

I 4)  Maximum  R improvement 

A one-variable  model  is  chosen  as  with  forward  selection.  Then 
I every  combination  of  variables  with  this  one  is  examined.  When  two 

variables  are  included  each  of  these  is  compared  to  each  variable  not 

I 

I 
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in  the  model.  For  every  comparison  it  is  determined  if  removing  the 

variable  in  the  model  and  replacing  it  with  the  excluded  variable 
2 

would  increase  R . After  all  comparisons,  the  switch  is  made  that 
2 

gives  the  highest  R . This  process  continues  with  each  variable  added. 
Optimum  one-to-eight  variable  models  are  most  likely  to  be  found,  but 
the  costs  in  computer  processing  are  high  when  more  than  20  candidate 
predictors  are  used  (Barr  e^  a^. , 1976) . 

Although  variable  selection  procedures  do  not  guarantee  that  an 
optimum  subset  of  predictors  is  chosen,  the  stepwise  procedure  does  a 
credible  job  up  to  about  the  fourth  variedjle  for  data  in  this  study. 
Comparisons  were  made  of  varicibles  selected  by  the  stepwise  procedure 
with  those  from  the  best  four-  to  seven-variable  models  where  all 
possible  regressions  were  considered.  In  all  cases  the  four-variable 
models  were  identical.  Ihe  five-  auid  six-varied)le  models  differed  by 
just  one  variable . The  best  seven-variable  model  differed  by  two  vari- 
ables. Due  to  coa^uter-processing  limitations,  comparisons  were  not 
exact  in  that  only  18  of  25  predictors  were  considered  for  all  possible 
regressions.  Even  for  this  combination  there  were  31,824  possibilities. 
In  the  case  of  the  seven-variable  model,  the  two  variables  not  selected 
by  the  "best"  procedure  were  not  available  to  it.  Beyond  five  pre- 
dictors there  could  be  any  number  of  vari^d)le  combinations  which  produce 

2 

the  same  or  even  slightly  higher  R . Therefore,  discussions  of  variable 
combinations  will  usually  be  limited  to  the  first  four  or  five. 

f . Interpretation  of  regression-model  results 

In  those  discussions,  intense  convection,  thunderstorm  occurrence, 
and  MDR^4  are  used  synonymously,  though  the  latter  is  the  true 
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predict^md.  Ttie  coefficient  of  determination,  R , the  amount  of  vari- 
ance accounted  for  by  the  linear  combination  of  variables,  and  reduction 
of  total  variance  due  to  the  regression  model  also  are  used  synonymously. 
Finally,  independent  variables,  predictors,  and  x's  mean  the  same  as  do 
dependent  variable,  Y,  and  predictand. 

Models  of  form  (1)  are  used  where  particular  x's,  parameters,  are 
chosen  by  variable  selection  techniques  discussed  in  Section  2e . The 
associated  coefficients,  0 Vs,  are  coii5)uted  according  to  the  least 
squares  method  (Section  2a) . Analysis  of  variance  tables  such  as  shown 
in  Table  1 (p.l2)  are  produced  for  every  different  combination  of  inde- 
pendent vari^d^les  and  for  all  data  subdivisions.  A few  of  these  ta±)les 
for  importemt  combinations  of  parameters  are  shown  in  Appendix  A.  In 

general,  however,  only  summaries  are  included  in  the  text.  These 

2 

sumnaries  present  the  total  R , number  of  varieUsles  (x's)  which  pro- 
2 

duced  the  R , mean  square  error  for  this  number  of  variables,  and 
occurrence  frequency  for  the  dependent  variable  (frequency  of  thunder- 
storm occurrence) . Also  shown  are  the  variables  selected  in  the  order 

2 

in  which  they  were  chosen,  the  cumulative  R , ^u^d  the  sign  of  the 
partial  regression  coefficient  (0)  for  each  data  stratification.  In 
order  to  reconstruct  the  linear  equation  for  a given  combination  of 
variables,  the  peurtial  regression  coefficients  from  Appendix  A are 
required.  These  coefficients  are  then  substituted  into  (1)  together 

with  their  respective  predictors. 

2 2 
Although  R will  be  discussed  to  some  extent,  the  R differences 

from  sample  to  sample  must  not  be  interpreted  to  imply  improved  re- 
gression results  unless  the  proportion  of  ones  (as  opposed  to  zeros)  is 
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I also  the  S2une.  For  a binomial  distribution  the  variance  is  given  by 

I np(l-p).  Since  this  term  appears  in  the  denominator  of  an  increased 

p (up  to  0.5)  results  in  a lower  R^  given  the  same  regression  sum  of 
I squares.  Three  exaji^les  follow,  each  using  similar  but  curtificial  data 

I I with  one  independent,  continuous  variable  positively  correlated  to  one 

dependent,  dichotomous  variable. 

I I 1)  Example  one:  occurrence  frequency  10% 

Assume  there  are  ten  observations  of  dependent  variable  y and 

I independent  variable  x and  that  the  frequency  of  ones  is  10%.  The  data 

I 

. arid  regression  analysis  are  shown  in  Table  2. 

Table  2.  Data  and  regression  analysis  for  10  observations  of  hypo- 
I thetical  varied^les  x and  y with  10%  occurrence  frequency. 


y (y-y)  ly-y)*  » (»-»)  Ix-x)  (y-y)  l(x-x)  (y-yi  )*  Sunwry  stxtixtic* 


2)  Example  two:  occurrence  frequency  30% 

T^d^le  3 illustrates  another  exanqple  with  the  same  s^unple  size  but 

different  occurrence  frequency.  Note  that  as  the  occurrence  frequency 
2 

increases,  R decreases.  This  decrease  is  a consequence  of  the  in- 
creased variance  of  y (higher  p) . 
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Table  3.  Data  emd  regression  analysis  for  10  observations  of  hypo- 
thetical variables  x and  y with  30%  occurrence  frequency. 


y y-y  (y-y)^  » x-»  (x-x)  (y-y)  I (x-xl  (y-y)  1^  Sunmry  statistics 


3)  Example  three:  random  sampling 

2 

We  will  now  consider  the  effect  of  r2mdom  sampling  on  R in 
TcUsle  4.  Table  3 is  duplicated  for  all  occurrences  but  for  only  57% 
of  the  nonoccurrences. 


Table  4.  Data  and  regression  analysis  for  57%  of  nonoccurrence 
observations  in  Table  S data. 


y 

(y-y) 

(y-y)* 

x 

x-x 

(x-iT)  * 

(x-x) (y-y) 

l(x-x) (y-y))* 

SuxMry  Statistics 

0 

-0.439 

0.184 

3 

-1.143 

1.306 

0.490 

0.240 

1 

0.571 

0.329 

4 

0.857 

0.735 

0.489 

0.239 

i 

0.571 

0.326 

4 

0.057 

0.735 

0.469 

0.239 

0 

-0.429 

0.164 

3 

0.143 

0.020 

-0.061 

0.004 

SST  - Ky-y)^  - 1.714 

0 

-0.429 

0.184 

2 

-1.143 

1.306 

0.490 

0.240 

1 

0.571 

0.326 

5 

1.857 

3.449 

1.060 

1.134 

R*  • 115  - 0.1S4 

0 

-0.429 

0.184 

2 

-1.143 

1.306 

0.490 

0.240 

X - 3.14}  y - 0.429  - p 

3 

1.714 

32 

8.857 

3.328  Sw 

Table  5 shows  the  comparison  of  psrtinont  statistics  for  the 


different  occurrence  frequencies  and  the  random  sample.  In  the  case  of 
the  random  sample,  sums  of  squares  of  x's  decrease  relative  to  the 
other  cases  because  x increases.  Total  sum  of  squares,  £(y-y)  , 
decreases  cospared  to  the  30%  sample  in  this  case  because  there  are 
fewer  elements  to  sum.  Finally,  the  squared  sum  of  cross  products 
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T&ble  5.  Conqparison  of  Tables  2,  3,  and  4. 


Term 

Table  2 
10%  ones 

Table  3 
30%  ones 

Table  4 

Random  (43%  ones) 

* 5 

3.0 

3.0 

3.140 

Kx-x) 

10.0 

12.0 

8.857 

y 7 

0.1 

0.3 

0.429 

^(y-y>  2 

0.9 

2.1 

1.714 

Z [ (x-5c)  (y-y)  ] 

3.30 

3.48 

2.328 

SSR 

0.330 

0.290 

0.263  - 

0.367 

0.138 

0.154 

MSE 

0.071 

0.226 

0.289 

n 

10 

10 

7 

les  but  at  a slower  rate 

than  in  Table  3. 

Consequently,  : 

creases  for  the  random  sample  compared  to  the  30%  case.  It  is  clear 

2 

from  these  examples  that  R cannot  be  used  as  a measure  of  relative 
strength  of  the  regression  model  vdien  the  frequency  of  occurrence 
changes.  Ihe  only  true  measure  of  "goodness"  will  be  the  performance 
of  the  function  in  an  operational  environment. 


g.  Principal  conponent  analysis 

Another  way  to  approach  the  multicollinearity  problem  is  through 
a technique  called  principal  component  analysis  first  introduced  to 
meteorology  over  two  decades  ago  by  Lorenz  (1956) . Brier  and  Meltesen, 
(1976)  give  a brief  history  of  meteorological  applications.  Only  a 
suaeuiry  of  the  methodology  will  be  presented  here. 

Assume  that  new  variables,  principal  components  (C^) , can  be 
generated  that  are  linear  combinations  of  observations  of  original 


variables  as  follows: 


C,  = b,  ,x. 

+ 

b. 

+ 

• • • + 

b 

X 

1 1,11 

1, 

2 2 

1 

f n)  m 

C_  = b-  ,x. 

+ 

b.. 

+ 

• • • + 

b.. 

X 

_2  2,11 

2, 

2 2 

2 

/jn  m 

(7) 


1^1  * X_  + . . . + b X 

m m,l  1 2,m  2 m,m  m 

Also  choose  coefficients  for  i.e.  so  that  the  variance  of 

is  as  large  as  possible.  Choose  the  coefficients  so  that  the  vari- 
ance of  is  as  large  as  possible  subject  to  the  constraint  that  ob- 
servations of  be  uncorrelated  with  those  of  C^.  We  continue  for  all 
and  impose  an  additional  restriction  that  squares  of  coefficients  in 
any  sum  to  one.  It  turns  out  (Harris,  1975)  that  if  the  eigenvalues 
and  eigenvectors  of  the  X'X  matrix  are  found  (since  it  is  real, 
symmetric,  and  positive  definite),  then  the  assumptions  are  fulfilled. 
Also,  the  components  of  the  eigenvectors  normalized  to  length  one  are 

the  b . . ' s . 

ID 

Since  the  variance  is  just  a measure  of  the  varieibility  for 
different  observations,  it  is  reasonable  to  interpret  as  that  linear 
combination  of  original  variables  which  maximally  discriminates  among 
our  observations.  These  components  also  partition  the  total  variance 
of  the  original  variables  into  m additive  parts,  hence,  tlie  interpreta- 
tion that  they  "account  for"  a certain  fraction  of  the  total  variance. 
Rows  (or  columns)  in  the  symmetric  (X'X)  matrix  which  are  linear  combi- 
nations of  each  other  will  produce  a zero  eigenvalue  and  will  contribute 
nothing  to  the  total  variance;  hence,  we  have  another  way  of  assessing 
multicol linearity  and  of  finding,  possibly,  how  many  true  dimensions 
or  hypothetical  latent  variables  there  are  in  the  particular  (X'X) 


matrix  which  is  evaluated  in  this  manner 


It  is  this  property  which 


has  led  to  recent  applications  in  meteorology  (Smith  and  Vtoolf,  1976; 
Brier  and  Nelteson,  1976) . A method  for  calculating  eigenvalues  is 
given  by  Essenwanger  (1976) . Hie  procedures  used  in  this  study  are 
those  available  in  the  statistical  analysis  system  (SAS)  (Barr,  et  al. 


1976)  . 
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3.  DATA  SEUECTION  AND  PRCX:ESS1NG 

a.  Location 

The  area  for  this  study  was  chosen  to  provide  relative  homogeneity 
in  terrain,  an  adequate  san^le  of  meteorological  observations,  and  as 
many  thunderstorm  occurrences  as  possible  during  the  time  digital 
radar  data  were  avail2d>le.  Ttie  period  chosen  included  April  through 
July  1974  and  1975,  30  days  in  each  month.  Surface,  upper-air  and 
meteorological  radar  data  were  used  in  the  analysis.  Each  will  be 
discussed  separately. 

b.  Surface  data 

Altimeter  setting,  wind  speed,  wind  direction,  temperature,  and 
daw  point  tec^erature  were  obtained  for  97  locations  as  shown  in  Fig.  4 
for  five  times  each  day:  1200,  1500,  1600,  1700,  and  1800  GMT. 

c.  Upper-air  data 

CX>servations  of  geopotential  height,  ten^rature,  dew  point  de- 
pression, wind  speed,  emd  wind  direction  at  1200  GMT  were  used  for  each 
of  four  standard  pressure  levels:  850,  700,  500,  and  300  mb^.  There 
were  14  upper-air  locations  (Fig.  4) . Both  surface  and  upper-air  data 
were  obtained  from  the  USAF  Environmental  Technical  Applications  Center 
at  Scott  AFB,  XL. 

d.  Radar  data 

Radar  data  consisted  of  manually  digitised  radar  (MDR)  observations 
^Only  geopotential  height  and  winds  were  utilised  for  the  300-mb 


level 
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for  each  hour  from  1630  to  0130  GWT  and  for  187  boxes  shown  within  the 
bold  line  in  Fig.  5.  Note  that  the  centers  cf  these  boxes  fall  within 
the  general  area  outlined  in  Pig.  4 (p.  32).  These  data  were  provided  by 
NOAA's  Techniques  Development  Ledx^ratory . Radar  observations  are  usually 
taicen  eUx>ut  30  to  35  min  past  each  hour  and  transmitted  in  coded  form 
(Table  6)  (Foster  and  Reap,  1973) . Digital  codes  represent  the  maximum 


Table  6.  Explanation  of  Meuiually  Digitized  Rad£ur  (MDR)  code 


Code  No. 

Maximum 
Observed 
VIP^  Values 

Coverage 

In  Box 

Maximum 
Rainfall 
Rate  (in. /hr) 

Intensity 

Category 

0 

No 

Echoes 

1 

1 

Any  VIPl 

<•1 

Weak 

2 

2 

< 

50%  of  VIP2 

.1-  .5 

Moderate 

3 

2 

> 

50%  of  VIP2 

.5-1.0 

Moderate 

4 

3 

50%  of  VIP3 

1. 0-2.0 

Strong 

5 

3 

> 

50%  of  VIP3 

1. 0-2.0 

Strong 

6 

4 

50%  of  VIP3 
and  4 

1. 0-2.0 

Very  Strong 

7 

4 

> 

50%  of  VIP3 

2md  4 

1. 0-2.0 

Very  Strong 

8 

5 

or  6 

< 

50%  or  VIP3, 

4,  5,  and  6 

>2.0 

Intense  or 

Extreme 

9 

5 

or  6 

> 

50%  or  VIP3, 

4,  5,  and  6 

>2.0 

Intense  or 

Extreme 

^Video  Integrator  Processor 


intensity  of  reflectivities  emywhere  in  a square  area  approximately  85 
Km  on  a side,  niese  codes  also  ta)ce  into  account  the  general  area 
coverage  of  the  echoes.  Time  composites  for  the  maximum  code  in  any 
of  the  following  groups  were  saved  for  each  day:  1635-1735,  1835-1935, 
1935-2235,  and  2235-0135  GMT.  These  will  be  called  1700-1800  GMT, 
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1900-2000  GMT,  2000-2300  GMT  and  2300-0200  GMT  periods,  respectively. 
Radar  data  had  to  be  grouped  by  time  intervals  to  obtain  an  adequate 
sample  because  many  hours  of  observations  were  missing.  There  are 
several  reasons  for  the  specific  groupings.  The  first  interval  is  to 
be  used  as  a candidate  predictor.  The  latter  three  were  all  predictands 
and  were  formulated  from  operational  considerations.  A 3-h  interval 
represents  a forecast  of  thunderstorms  valid  within  1.5  h of  an  esti- 
mated time  of  arrival  for  aircraft  flight  operations.  For  exait^le,  an 
aircrew  may  obtain  a weather  briefing  at  1830  GMT  for  a 5.5-h  flight 
with  departure  time  estimated  to  be  1900  GMT  and  estimated  arrival  time 
at  destination  of  0030  GMT.  A forecast  for  intermittent  thunderstorms 
would  cover  the  period  from  2300  to  0200  GMT.  This  interval  is  reason- 
able owing  to  operational  uncertainties  such  as  delays  in  departure  and 
landing  for  long  flights  emd  to  uncertainties  in  predicting  the  thunder- 
storm event  so  long  in  advance.  The  1-h  interval  at  the  earlier  time 
reflects  both  reduced  forecast  and  operational  uncertainties  because  of 
the  short  forecast  lead  time  and  brief  flying  time.  For  exaii{>le,  a crew 
for  a 1-h  flight  may  get  a weather  briefing  at  1800  GMT  for  estimated 
arrival  at  1930  GMT.  The  forecast  would  then  cover  the  period  from 
1900  to  2000  GMT.  Finally,  an  attempt  was  made  to  avoid  overlapping 
intervals  so  that  forecasts  for  the  different  times  could  be  compared. 

e.  Initial  processing 

Raw  data  were  available  on  magnetic  tapes.  Programs  were  written 
to  (1)  select  specific  observed  elements,  times,  and  stations;  (2)  ensure 
all  missing  hours  and  days  were  accounted  for;  (3)  check  for  gross 
errors  in  reported  values;  and  (4)  write  all  data  onto  a direct  access 
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storage  device.  Observations  that  were  either  missing  or  which  con- 
tained numbers  outside  the  range  of  what  would  be  considered  reportable 
values  for  that  varicible  were  filled  with  zeros  and  ignored  in  subse- 
quent processing.  Many  observations  were  checked  against  archived 
teletype  data  to  ensure  accuracy. 

f . Objective  analysis 

The  results  of  this  research  were  dependent  upon  the  representative- 
ness of  raw  data  interpolated  or  euialyzed  onto  an  equally-spaced  grid 
system.  Therefore,  considerable  care  was  tciken  in  choosing  an  analysis 
procedure  and  grid.  An  18  x 18  array  of  grid  points  spaced  65  km 
apeurt  was  chosen  to  preserve  as  much  detail  in  the  surface  and  radar 
data  fields  as  possible.  Boundary  points  were  used  only  for  the  calcu- 
lation of  derivatives  so  that  only  256  points  (16  x 16)  were  used  for 
statistical  correlations.  An  analysis  scheme  by  Barnes  (1964)  was 
selected,  not  only  because  results  obtained  were  very  similar  to  hand 
euialysis,  but  also  because  scales  of  atmospheric  features  retained  by 
this  technique  could  be  determined,  and  the  program  was  efficient.  Scan 
radii  and  initialization  procedures  were  adjusted  to  produce  an  optimum 
balance  among  the  following:  (1)  cost,  since  we  had  12,480  total 

7 

analyses  to  perform  ; (2)  missing  data;  (3)  an^lification  of  spurious 
waves;  (4)  small-scale  surface  features;  (5)  radar  grid  transposition; 
and  (6)  duplication  of  manual  analyses.  The  optimum  choice  for  scan 

^240  days  x (5  surface  variables  x 5 tines  * 5 upper-air  variables 
X 3 levels  * 3 upper-air  variables  x 1 level  + 1 radar  variable  x 9 
tiMs ) . 


radius,  number  of  iterations,  and  cheiracteristic  wave  lengths  preserved 
as  a result  of  these  choices  are  sunmarized  in  Table  7.  Wind  was  con- 
verted to  components  with  respect  to  grid  orientation  (nearly  latitude- 
longitude  aligned) . These  and  all  other  basic  variables  were  analyzed 
onto  the  18  x 18  grid  array  for  each  time  and  day . From  these  data  the 
predictauid  eind  candidate  predictors  were  computed  at  each  grid  point 
as  discussed  in  the  next  Section. 

Table  7.  Summary  of  analysis  parameters. 


Data 

iourc* 

Avoraqc  data 
Spacing 

Scan 

Radius 

Iterations 

Initialisation 

Wavelength  of 
90%  amplitude 
Preservation 

Wavelength  of 
50%  an^litude 
Pri'scrvat  ion 

•ur face 

130  (UK 

275  km 

3 

Mean  value  of 
parameter 

450  km 

300  km 

Upper  air 

370  km 

520  km 

3 

Mean  value  of 
paraskcter 

900  km 

600  km 

Radar 

03  km 

64  km 

1 

0 

• 

• 

With  on*  iteration  this  was  eanentially  an  interpolation  of  the  nearest  MOR  observation  to 
each  grid  point. 
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4.  PARAMETERIZATION  AND  DATA  SUBDIVISION 

Ttiis  section  includes  the  formulation  of  predictands  and  the 
development  of  predictors  in  the  context  of  parameterization  of  synop- 
tic observations.  Also  discussed  is  the  subdivision  of  the  total  data 
set  into  subsets  for  statistical  processing. 

a.  Predictand  formulation 

Coded  MDR  data  from  the  65-km  grid  and  three  time  groups,  1900-2000 
GMT,  2000-2300  GMT,  and  2300-0200  GMT,  were  converted  to  a single  binary 
form.  Any  MDR  code  equal  to  or  greater  than  four  at  any  of  the  four 
nearest  neighbor  grid  points  as  shown  in  Fig.  6 was  assumed  to  repre- 
sent the  occurrence  of  a thunderstorm  (Mogil,  1974),  and  was  assigned 
the  binary  code  one;  otherwise  code  zero  was  assigned.  The  data  void 
areas  in  this  figure  result  from  the  use  of  every  fourth  grid  point  for 
the  statistical  analyses.  A zero  could  only  be  assigned  if  the  grid 
point  in  question  and  the  nearest  neighbors  were  all  reporting  MDR 
codes  less  than  four.  The  best  resolution  in  the  predictand  area  is 
limited  to  a square  area  about  138  km  on  a side.  This  was  the  smallest 
area  for  which  unique  information  from  the  original  radar  grid  (83  km 
square)  was  available.  We  have  not  distinguished  among  precipitation 
intensities  (or  thunderstorm  severities)  in  this  study. 

b.  Predictor  formulation 

One  approach  now  tenqpting  many  investigators  because  of  expanded 
computer  capabilities  is  to  use  every  imaginable  parameter  as  a candi- 
date predictor.  For  just  the  basic  ^malyzed  variables  (temperature, 
wind  con^nents,  pressure,  etc.)  along  with  their  first  and  second  time 
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and  space  derivatives,  there  would  be  well  over  100  candidates,  many 
of  which  would  be  interrelated.  Selection  techniques  for  such  a large 
number,  not  even  counting  products  or  time  changes  of  space  derivatives 
and  vice  versa,  would  be  expensive  and,  more  important,  results  would 
be  extremely  difficult  to  interpret.  In  this  study  all  predictors  have 
been  chosen  through  parameterization  techniques  for  categories  of  vari- 
ed>les  known  to  be  associated  with  thunderstorms . 

g 

It  is  generally  recognized  that  there  are  three  synoptic-scale 
conditions  for  thunderstorms:  moisture,  potential  instability,  and  a 
trigger  mechanism.  Therefore,  parameters  to  represent  these  ingredients 
were  calculated  from  centered  finite  differences  for  which  the  distance 
interval  was  twice  the  grid  distance  or  130  km.  In  addition,  a nine- 
point  Laplaciem  routine  for  was  used,  where  A is  any  scalar.  All 
parameters,  along  with  their  definition  and  source,  are  shown  in 
Table  8.  Each  group  will  be  discussed  separately. 

1)  Moisture 

The  first  set  of  moisture  variables  includes  the  equivalent 

potential  temperature  (0  ) at  several  levels  in  the  atmosrtiere,  its 

e 

time  chemge,  gradient  magnitude,  and  advection.  This  parameter  has 
been  used  for  many  years  as  a means  of  identifying  air  samples  owing 
to  its  conservative  properties  for  both  dry  emd  saturated  adiabatic 
processes.  It  has  been  used  recently  in  conjunction  with  the  location 
of  the  thunderstorm  updraft  (see,  for  exeunple,  Ellrod  and  Marwitz,  1976; 
Fankhauaer,  1974;  Brandes,  1977)  . High  values  of  0^  represent  a 

g 

The  data  network  restricts  the  horizontal  scales  to  300  to  1500  km. 
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Table  6.  Candidate  predictors 
(a)  Moisture  Parameters 


Symbol 

Definition 

Source 

Time 

0 

e 

0 (exp (LW  /C  T) ) 
s p 

Surface 

1800 

o 

GO 

same  except  850  mb  values 

Upper  air 

1200 

0 

e? 

same  except  700  mb  values 

Upper  air 

1200 

30a 

3t 

0 

(1800  GMT)  - 0 (1500  GMT) 
e e 

Surface 

(15001 

tl800j 

a^Oe 

3t2 

( 

(0  (1500  GMT)-0  (1200  GMT))] 
ot  e e 

Surface 

[12001 

1500 

ll800j 

1800 

1^0  1 
e 

^3x  ' '3y  ' 

Surface 

0 A 
e 

- 

Surface 

1800 

T-T^ 

a 

T 

-■^d 

Surface 

1800 

same  except  850  mb  values 

Upper  air 

1200 

(T-T^)7 

same  except  500  mb  values 

Upper  air 

1200 

w 

0 

.622e/P-e 

Surface 

1800 

e 

p 

- -1013.25  + 1013. 25/(1. O-a(z))^ 

+ALTSTG,  vrtiere  a(z)  * .0065z/288.0, 

b - 5.246 

same  except  for  850  mb 

Upper  air 

1200 

same  except  for  700  mb 

Upper  air 

1200 

l^w| 

\ 

,3K,2  . ,3»,2 

3^’  : V 

Surface 

1800 

3 w 3 w 

3x^  3y^ 

Surface 

1800 

MDIV 

•wv 

Surface 

1800 

Id) 

s 

3W  3u  3W  3v,2  3w  3u  . 3w  3v  2 

3x  3y  3y  3y'  '^'ax  ^ 3y  3x' 

Surface 

1800 

* 
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Table  8.  Candidate  predictors  (Continued) 
(b)  Stability 


Symbol 

Definition 

Source 

Time 

DTA 

(-V-^  T)„  - (V*^  T) 

P 8 p '5 

Upper  air 

1200 

CSIL 

0-0 
e?  60 

Upper  air 

1200 

CSIM 

0 - 0 
es  ea 

Upper  air 

1200 

Kl 

^8  ^ ’'de  -<'^-'^d^  - ^5 

Upper  air 

1200 

TTI 

2(Tq  - T5)-(T  - T^)g 

Upper  air 

1200 

STSI 

GO 

0) 

CD 

1 

m 

<D 

Upper  air 

1200 

UWSH 

“5  - "8 

Upper  air 

1200 

DTK 

(Zg-Z^)/150  -(Z^-Zg)/200 

Upper  air 

1200 

THA 

P 

^^p(-V^-^(DTH) ) 

Upper  air 

1200 

T 

Ten^rature 

Surface 

1800 

0 

T (1000/P) 

Surface 

1800 

I 

I 
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Table  8.  Candidate  predictors  (Continued) 
(c)  Trigger 


Symbol 

Definition 

Source 

Time 

“ts 

( 

-V*^Z)gp  + (^•V)50 

Surface 

1800 

9^P  3^P 

Surface 

1800 

C 

9v  3u 

3x  9y 

Surface 

1800 

DVA 

( 

Upper  air 

1200 

IDIV 

2 

!.25(^  -V)^  + 1.75(^  -V).  + l.O(^^-V)^ 

P o P V P 3 

Upper  air 

1200 

IMDIV 

(^•WV)„  + (^  ‘W^)-  + -W^), 

o P ' p 5 P 3 

Upper  air 

1200 

15,1 

2 2 
“5  ^ 

Upper  air 

1200 

*5 

500  mb  N-S  wind  component 

Upper  air 

1200 

V3UM 

\ 

'5  ■^''8 

Upper  air 

1200 

I^pI 

\ 

(3P)2  ^ (3P)2 
'9x'  '9y' 

Surface 

1800 

v-^ 

9p  ^ 9p 
“9^ 

Surface 

1800 

3(^^P) 

dt 

^^P(1800  GMT)  - ^^P(1500  GMT) 

Surface 

rl500 

^1800 

MORP 

MDR  code  > 1 at  1700  or  1800  GMT 

Radar 

rl700 

^1800 

I 


r 
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potential,  latent  energy  source  (warm  moist  air)  for  the  convective 
process . 

Next,  basic  measures  of  low-level  relative  humidity  are  included. 
These  are  expressed  as  dew  point  depressions.  The  last  group  of  mois- 
ture parameters  are  basic  measures  of  atmospheric  water  vapor  content. 

T?'e  mixing  ratio  has  been  combined  with  the  divergence  field  of  surface 
wind  in  ^"WV  so  that  moisture  advection  and  convergence  are  included 
in  a single  term.  This  has  been  a leading  predictor  in  other  studies 
(Charija,  1977;  Ala}ca  et  al . , 1973;  Henz,  1974).  The  Laplacian  of  mixing 
ratio  Identifies  centers  of  high  moisture  (negative  Laplacicui) . A 
term  which  combines  both  the  deformation  field  of  the  wind  and  surface 
moisture  pattern  has  been  introduced  in  IdI . This  is  similar  in  form 
to  the  frontogenetic  function  of  Petterssen  (1956,  p.  201)  with  0 re- 
placed by  W and  is  discussed  in  Palmen  emd  Newton  (1969,  p.  246).  It 
is  a way  of  locating  where  shear  and  confluence  of  surface  wind  could 
concentrate  moisture.  Pig.  7 shows  schematically  how  this  might  be 
accomplished.  Prime  quantities  represent  isolines  after  a time  incre- 
ment At.  The  lines  of  W in  Fig.  7a  have  been  shifted  to  the  left  for 

convenience.  Consider  the  magnitude  of  As  decreases,  |^w|  in- 

3v 

creases.  Similarly,  as  decreases,  |vw|  increases. 

2)  Stability 

There  are  numerous  ways  of  estimating  atmospheric  static  stability. 
Differential  temperature  advection  where  cold  air  is  advected  over  warm 
air  or  vice  versa  is  a way  of  incorporating  )iinematics  (wind  structure) 
and  time.  So  long  as  the  advection  is  constant  with  cold  advection 
^d>ove  warm  advection,  the  atmosphere  will  respond  by  decreasing  stability 
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with  time.  Similarly,  the  horizontal  ten^erature  gradient  is  related 
to  the  vertical  wind  shear  euid  differential  temperature  advection  will 
be  reflected  in  adjustment  of  the  thickness  field.  Both  wind  shear 
and  thickness  differences  have  been  included.  The  Laplacieui  of  thick- 
ness advection  should  be  a way  of  locating  centers  of  strong  differential 
teiqperature  changes  which,  in  a subsequent  time  interval,  could  be 
related  to  thunderstorm  development.  Convective  instability  is  impor- 
tant to  thunderstorm  development  (Koch,  1975) . This  type  exists  in  the 
atmosphere  in  those  layers  where  0^  decreases  with  height.  There  are 
three  parameters  in  which  a finite  difference  version  of  this  term  are 
included.  The  last  of  these  is  static  st^d}ility  discussed  by  Paine 
and  Kaplan  (1974) . Finally,  standard  parcel  stability  measures  and 
surface  values  of  tenqperature  and  potential  temperature  were  used. 

3)  Trigger  mechanism 

Many  days  occur  when  sufficient  moisture  and  instability  are  both 
present  emd  yet  there  are  no  thunderstorms.  A trigger  mechanism  is 
needed  to  release  the  instability  and  latent  energy.  Usually,  this 
trigger  is  manifested  in  vertical  motion,  so  that  we  need  to  find  a 
lifting  mechanism.  Terrain-induced  vertical  motion  is  included  as  a 
predictor  combined  with  surface  velocity  divergence.  The  vorticity 
field  at  the  surface  measured  by  the  vertical  component  of  the  curl  of 
the  surface  wind  field  or  indirectly  through  the  pressure  Laplacian  is 
another  potential  uplift  mechanism  through  convergence  vhich  it  induces. 
Fronts  are  frequently  associated  with  thunderstorms.  A front  can  be 
identified  through  the  wind,  tesq>erature , moisture,  and  pressure 


I 
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fields.  Temperature,  noisture,  pressure  gradients  and  the  advections 
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of  0^  and  P were  Included  as  paurameters . Measures  of  vertical  motion 
can  be  obtained  in  only  a crude  way  from  data  at  just  five  levels  in 
the  atmosphere.  Both  integrated  divergence  (sums  of  divergence  for 
three  levels)  and  integrated  moisture  divergence  were  included  as 
predictor  parameters.  Differential  vorticity  advection  (DVA)  is  in- 
cluded as  a parameter  since  it  together  with  the  Laplacian  of  thickness 
advection,  are  the  two  terms  in  the  w-equation  (Holton,  1972) . The 
meridional  wind  component  at  500  mb  is  a measure  of  the  strength  and/or 
proximity  of  an  approaching  trough  if  a general  west  to  east  wave 
motion  exists.  Vorticity  advection  and  vertical  motion  usually  ensue. 
The  v-component  sum  at  850  and  500  mb  measures  the  degree  to  which  the 
wind  is  in-phase  at  these  two  levels  east  of  a trough.  The  more  out-of- 
phaae,  the  lower  this  sum  would  be;  therefore,  one  would  be  looking  at 
a measure  of  the  baroclinity  of  the  lower  atmosphere . A negative  cor- 
relation of  this  pareuneter  measured  at  1200  GMT  with  thunderstorms 
later  in  the  day  would  be  e;q>ected.  Finally,  an  increased  tendency  for 
cyclogenesis  at  the  surface  may  be  associated  with  general  uplift  and, 
therefore,  a trigger  mechanism  for  subsequent  thunderstorms.  The  time 
change  of  the  Laplacian  of  the  surface  pressure,  (^^p) , is  one 
such  indicator. 

The  last  trigger  shown  in  T^d)le  8 is  a binary  radar  parameter.  Any 
MOR  code  (two  or  greater)  during  the  time  period  1700  to  1800  GMT  for 
each  grid  point  was  coded  as  one;  otherwise,  zero  was  assigned.  In  this 
way  a one  represents  any  precipitation  occurring  near  the  time  the  fore- 
cast is  to  be  made.  Usually,  when  other  conditions  are  right,  any  pre- 
cipitation at  this  time  of  the  morning  either  maintains  its  intensity 
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by  propagating  within  the  predictand  area  when  the  code  is  already 
greater  than  four,  or  develops  into  a thunderstorm  in  the  subsequent 
2 to  5 h.  This  predictor  is  the  only  direct  measure  of  vertical  motion 
or  trigger  among  all  predictor  parameters.  Of  course,  some  of  the  para- 
meters could  contribute  to  more  than  one  condition  for  thunderstorms , 
Consider,  for  example,  the  discontinuity  function,  |D| ; while  listed 
under  moisture,  it  mi^t  also  be  discussed  in  conjunction  with  the  dry 
line  and  f rontogenesis  or  a trigger  term.  Similarly,  the  Laplacian 
of  thickness  advection  is  a term  in  both  the  oi-equation  and  Petterssen's 
development  of  surface  vorticity  tendency.  It  could  be  shown  with  the 
trigger  terms  as  well. 

c.  Subdivision  of  original  data 

The  total  data  set  consists  of  parameters  calculated  at  each  of 
256  grid  points  for  240  days.  However,  for  reasons  discussed  in  Section 
3,  not  every  grid-point  was  used.  There  were  a total  of  7680  observa- 
tions possible  in  the  data  set  used  for  subsequent  statistical  analysis . 
However,  an  observation  which  contained  any  missing  element  was  not 
used.  The  data  were  then  subdivided  into  groups  as  shown  in  Fig.  8. 

1)  Developmental  and  test 

Subdivision  of  the  original  data  set  into  developmental  and  test 
groups  was  necessary  so  that  some  type  of  quality  measure  or  verifica- 
tion could  be  obtained.  Every  third  day  is  considered  to  be  independent 
for  temperature  (Panofsky  and  Brier,  1958) . Therefore,  data  in  every 
third  day  (day  one,  day  four,  day  seven,  ...)  were  used  as  a test  seunple. 
The  developmental  sample  included  data  in  all  other  days . As  far  as 
thunderstorms  were  concerned,  the  assunqption  of  independence  was 


Fig.  8 


Subdivision  of  total  data  set 
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examined  for  a few  grid  points  in  the  test  sample.  Ones  were  not  ob- 
served for  two  consecutive  periods  (day  one  and  day  four,  for  example) . 
The  developmental  sample  was,  therefore,  considered  to  be  the  dependent 
sample;  the  test  data  set  was  the  independent  sample.  Equations  were 
developed  from  statistical  models  applied  to  the  developmental  sample 
and  tested  on  the  test  sample. 

2)  North  wind  and  south  wind 

Thunderstorms  are  observed  to  develop  and  behave  somewhat  different- 
ly in  different  types  of  synoptic  situations  or  in  different  air  masses 
(Purdoi?,  1975) . Subdivision  by  air  mass  may  give  better  results  in  this 
type  of  statistical  analysis  where  the  seunple  includes  several  thunder- 
storm seasons  for  a laurge  area.  Partitioning  by  air  mass  was  not 
directly  possible  with  the  historical  data  available.  However,  a 
division  of  data  by  surface  wind  component  at  1800  GMT  was  considered 
to  be  a fair  substitute.  Consequently,  the  data  were  divided  into 
north  wind  and  south  wind  sets  depending  on  whether  or  not  the  surface 
wind  had  a northerly  or  southerly  component  at  1800  GMT,  respectively. 
Separate  regression  analyses  were  performed  on  each  subset. 

3)  April  and  July 

All  days  amd  observation  points  in  April  for  the  two  years  of  data 
were  combined.  The  same  was  done  for  July.  Again,  analyses  were  per- 
formed within  each  data  set  to  determine  differences,  if  any,  in  spring 
and  summer  predictors . 

4)  Random  sample 

Samples  were  chosen  by  random- number  generators  so  that  develop- 
mental samples  contained  nearly  the  same  number  of  occurrences  and 


I 

I 
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nonoccurrences . The  unequal  natural  frequencies  of  thunderstorms 
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versus  no  thunderstorms  create  problems  in  regression  analysis  when  the 
dependent  variable  is  binary.  These  problems  were  discussed  in  Section 
2.  Results  of  application  of  the  various  statistical  techniques  outlined 
in  Section  2 to  subsets  discussed  here  are  presented  next. 
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5. 


RESULTS 


First,  comparisons  of  results  for  different  predictand  times  will 

be  presented.  The  types  of  predictors  selected  and  the  order  in  which 

they  were  included  in  the  model  will  be  discussed  next  for  all  data 

subdivisions.  Following  this  will  be  the  importance  of  surface  versus 

upper-air  parameters  to  the  prediction  of  thunderstorms.  A discussion 
2 

of  the  maximum  R or  variance  reduction  achieved  will  follow.  Next, 
perforn':;uce  of  the  equations  applied  to  an  independent  data  set  will 
be  presented  followed  by  comparisons  with  results  of  other  investigators. 
Results  of  a principal  component  analysis  will  be  presented  next. 

Last  will  be  a discussion  of  the  utility  of  these  equations  in  an 
operational  environment. 

a.  Forecast  time  intervals 

Regression  models  were  tested  with  fixed  numbers  of  independent 

variables  emd  three  time  combinations  of  the  dependent  variaJole:  1900- 

2000  GMT,  2000-2300  GMT,  and  2300-0200  GMT.  Random  samples  were  chosen 

2 

so  that  p was  nearly  the  same.  The  R decreased,  as  expected,  when 
the  time  interval  between  observations  and  forecasts  lengthened;  how- 
ever, the  occurrence  frequency  of  the  predictemd  was  only  9.3%  in  the 
first  period.  With  so  few  occurrences,  this  equation  would  likely 

9 

deteriorate  when  applied  to  independent  data.  In  other  words,  there 

9 

"Deteriorate"  means  that  probabilitiea  of  thunderstorms  produced 
by  the  linear  equation  developed  from  data  in  a dependent  or  develop- 
mental sample  would  not  correspond  well  with  observed  frequencies  of 
occurrence  when  these  equations  are  used  on  an  independent  sample  of 
data. 
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would  not  have  been  enough  different  thunderstorm-producing  environ- 
ments included  in  the  sample.  Also,  extrapolation  of  existing  radar 
echo  patterns  would  seem  to  be  a more  promising  technique  for  those  1- 
to  2-h  forecasts.  Similarly,  it  is  not  likely  that  observed  features 
of  the  atmosphere  early  in  the  morning  would  adequately  reflect  ingre- 
dients for  the  occurrence  of  thunderstorms  late  in  the  afternoon.  Con- 
sequently, only  the  2000-2300  GMT  period  was  included  in  all  further 
^ulalyses . 

b.  Predictor  selections 

The  order  of  selection  and  specific  predictors  selected  by  a step- 
wise, vari^d^le  selection  technique  are  shown  in  Table  9 for  different 
groups  of  data.  Only  the  first  six  of  many  predictors  offered  as 
Ccindidates  are  shown.  No  matter  how  the  data  are  divided,  the  three 
variables  consistently  selected  include  a combination  of  moisture  and 
trigger  terms;  the  next  several  invariably  include  a measure  of  atmo- 
spheric instability  through  either  stability  indices  or  linear  combi- 
nations of  vertical  temperature  and  moisture  parameters.  The  first 
four-to-five  variables  include  all  the  synoptic-scale  conditions  for 
intense  convection.  Therefore,  it  is  not  surprising  that  more  th^m 
85%  of  the  total  variance  explained  by  the  regression  model  is  accounted 
for  by  the  first  five  vari^JJle8. 

In  the  case  of  the  north  wind  and  all  other  subsets  except  south 
wind,  the  single  most  important  predictor  was  the  surface  mixing  ratio. 
The  presence  of  precipitation  (MORP)  near  1800  GMT  was  most  important 
for  the  south  wind  data.  In  the  area  chosen  for  this  study,  a south 
wind  implies  the  presence  of  maritime  tropical  air  which  contains 
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considerable  moisture.  Therefore,  a trigger  mechanism  identified  by 
MDRP  would  be  an  important  parameter  contributing  to  thunderstorms, 
given  that  moisture  is  already  present. 

For  the  April  and  July  subsets  the  first  three  predictors  are  the 
same.  During  April,  a 500-inb  trough  (v-wind  con^nent  at  500  mb)  and 
concentration  of  moisture  gradient  at  the  surface  through  the  deforma- 
tion field  of  the  wind  are  the  next  most  important  parameters.  This 
latter  predictor  can  be  interpreted  to  represent  the  location  of  the 
surface  dry  line  which  is  recognized  as  a favored  region  for  severe 
weather  (Miller,  1972) . In  the  spring,  surface  winds  are  stronger  and 
gradients  more  intense  than  in  summer.  Therefore,  one  would  expect 
these  quantities  to  be  reflected  more  in  the  synoptic  data  which  are 
utilized.  Diiring  July  stability  measured  by  the  Total-Totals  Index  is 
the  fourth  predictor  chosen.  'Riis  development  is  reasonable  owing  to 
the  weaker  winds  in  the  summer. 

In  a separate  emalysis,  four  different  time  changes  were  computed 
for  five  surface  variables.  These  were  the  l~h,  3-h,  6-h  ^md  3-h  chcuige 
in  the  3-h  time  change  for  the  following:  0^,  MDIV,  WTS,  0^A,  and 
When  these  were  used  as  candidate  predictors  in  the  stepwise 
selection  procedure,  they  were  not  chosen  among  the  top  five  predictors. 
Also,  when  time  derivatives  were  selected,  the  3-h  amd  6-h  changes  were 
chosen  before  1-h  changes.  One  possibility  for  this  result  is  that  the 
original  spacing  of  surface  data  and  analysis  procedures  restricts  the 
amplitudes  of  resolvable  features.  Six-hour  features  are  more  likely 
to  have  the  larger  amplitudes  which  can  trigger  intense  convection  later 
in  the  afternoon.  More  work  needs  to  be  done  in  this  area. 
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The  signs  of  regression  coefficients  are  as  esqpected  \dien  other 
variables  are  included  in  the  model.  For  exan^le,  the  sign  of  the 
temperature  coefficient  is  interpreted  as  the  ch^ulge  in  predictand  for 
a unit  change  in  temperature  while  holding  constant  all  other  variables 
in  the  model  at  that  time.  The  negative  sign  indicates  that  given 
that  surface  moistxire  (among  other  things)  already  is  present,  then 
thunderstorms  occur  with  lower  temperatures  or  when  the  air  is  more 
nearly  saturated.  The  total  correlation  coefficient  for  temperature 
shown  in  Table  10  indicates  a positive  correlation  of  temperature  ^uld 
thunderstorms  when  all  other  variables  are  ignored. 

c.  Importance  of  surface  versus  upper-air  parameters 

Regression  models  were  utilized  with  stepwise  procedures  for  sur- 
face variables  and  upper-air  variables  separately.  Results  are 

sunmarized  in  Table  11.  Surface  parameters  alone  in  linear  combination 

2 

accounted  for  15%  of  the  total  varieuice  (R  - 0.150),  whereas  upper-air 
par^uneters  accounted  for  only  13.4%.  When  )30th  sets  were  used  to- 
gether, however,  the  best  results  were  obtained;  R^  in^roved  to  0.197. 
Both  timeliness  and  spatial  resolution  contributed  to  this  result. 
Surface  data  were  available  at  1800  GMT,  2-5  h before  thunderstorm 
occurrence  as  opposed  to  upper-air  observations  at  1200  GMT.  Also, 
surface  stations  are  spaced  about  120  )cm  apart  compared  to  370  )(m  for 
upper-air  reports.  Space  derivatives,  which  are  used  extensively  as 
parameters,  are,  therefore,  more  nearly  represented  by  finite  differ- 
ences in  the  case  of  the  former.  Even  though  the  upper-air  predictors 

were  old  and  contained  poor  spatial  resolution,  when  combined  with  sur- 

2 

face  predictors,  they  produced  a 30%  improvement  in  R . It  af^ars 
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Table  10. 

Linear  correlation  coefficients  of  selected  predictors 
with  the  occurrence  of  thunderstorms  during  the 
period  2000-2300  GMT. 

Predictor 

Time  of 

Observation  (GlfT) 

Correlation 

Coefficient 

Significance 
Probability  level 

0 

1800 

0.280 

0.0001 

e 

U) 

TS 

1800 

-0.154 

0.0001 

MDIV 

1800 

0.173 

0.0001 

0 A 

1800 

0.103 

0.0001 

e 

LP 

1800 

0.079 

0.0001 

c 

1800 

0.099 

0.0001 

Pi 

1800 

0.050 

0.0001 

l«i.l 

1800 

0.041 

0.0006 

1800 

-0.082 

0.0001 

CSIM 

1200 

-0.275 

0.0001 

CSIL 

1200 

-0.230 

0.0001 

K1 

1200 

0.289 

0.0001 

TTI 

1200 

0.251 

0.0001 

STSI 

1200 

-0.274 

0.0001 

UWSH 

1200 

-0.130 

0.0001 

DVA 

1200 

0.008 

0.5173 

LTHA 

1200 

0.052 

0.0001 

DTA 

1200 

0.051 

0.0001 

IDIV 

1200 

-0.040 

0.0008 

IMDIV 

1200 

-0.041 

0.0006 

|J,I 

1200 

-0.128 

0.0001 

I 
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Table  10 


(Continued) 


Predictor  Time  of 

Observation  (GMT) 

Correlation 

Coefficient 

Significance 
Probability  level 

DTK 

1200 

-0.016 

0.0001 

0 

1200 

0.283 

0.0001 

ee 

0 

1200 

0.220 

0.0001 

«7 

1200 

0.311 

0.0001 

“7 

1200 

0.233 

0.0001 

(T-T^)7 

1200 

-0.174 

0.000 

1200 

-0.213 

O.OOOi 

''s 

1200 

0.090 

0.0001 

W 

1800 

0.328 

0.0001 

VSUM 

1200 

0.113 

0.0001 

U 

1800 

-0.043 

0.0001 

V 

1800 

0.050 

0.0001 

T 

1800 

0.166 

0.0001 

T-Tj 

a 

1800 

-0.190 

0.0001 

MORP 

1735 

0.324 

0.0001 
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Table  11. 

Summary 

surface 

of  statistics  for  regression  analyses  with 
and  upper-air  predictors. 

Total 

Occurrence 

Data 

Max  R^ 

Number 

Mean 

1 

Sample 

Frequency 

of 

Squared 

Size 

Predictors 

Error 

1 

7492 

17.9 

Surface 

0.150 

11 

0.125 

1 

7492 

17.9 

Upper  air 

0.134 

16 

0.128 

1 

7492 

17.9 

Surface  amd 

0.197 

24 

0.118 

1 

upper  air 

1 

7125 

17.9 

Upper  air. 

0.243 

20 

0.114 

surface  and 

1 

MDRP 

that  poor  as 

they  are. 

these  predictors 

fill  cm 

importamt  gap 

in  iden- 

tlfying  those  observed  features  of  the  atjnosphere  which  are  subsequently 


related  to  intense  convection.  Surface  data  alone  give  little  indica- 
tion of  the  potential  stability  of  the  atmosphere.  It  is  this  ingre- 
dient which  is  added  by  including  upper-air  parameters.  'Hie  dew-point 
depression  at  the  700-mb  level  is  the  first  upper-air  predictor  included 
by  the  stepwise  procedure.  Also,  it  is  the  third  p^u^^uneter  following 
low-level  moisture  and  moisture  divergence.  Stability  alone,  however, 
gives  inadequate  information  for  predicting  subsequent  thunderstorms . 

From  Table  10  (p.  57)  it  is  seen  that  the  highest  correlation  coefficient 
between  the  predictand  and  any  single  stability  measure  is  0.289  for  the 

r K index.  Several  other  variables  such  as  N,  the  radar  predictor  (MDRP) , 

1 

and  equivalent  potential  temperature  differences  exhibit  higher  cor- 

I 

1 

r 


relations 
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d.  Quality  of  fit  of  the  regression  model 

While  the  conditions  for  thunderstorms  are  known  with  some  con- 
fidence as  far  as  synoptic  data  are  concerned,  there  is  little  confi- 
dence in  determining  these  conditions  from  the  study  data.  For  exan^le, 
stability  can  be  obtained  from  the  vertical  structure  of  ten^rature 
and  moisture  profiles.  When  a limited  sample  of  these  data  at  a few 
fixed  levels  in  the  troposphere  comprise  our  measures,  only  approxi- 
mations to  the  stability  can  be  made.  There  is  a number  of  these 
approximations  depending  on  levels,  variable  combinations,  and  physical 
assumptions  (parcel  method,  layer  method,  etc.).  Similarly,  the  trigger 
mechanism  must  be  inferred  since  vertical  motion,  the  usual  trigger,  is 
not  one  of  the  observed  variables.  Finally,  the  parameters  contributing 
to  many  thunderstorm  occurrences  exist  on  a scale  much  smaller  than  we 
can  resolve  with  our  data.  Thunderstorms  have  been  observed  to  occur 
at  boundaries  and  intersections  of  pressure  discontinuities  (gust  fronts) 
caused  by  previous  cells  (Purdom,  1974) . Similarly,  they  have  been  ob- 
served to  develop  in  the  afternoon  in  areas  which  were  void  of  clouds 
that  morning  (Weiss  zuid  Purdom,  1974) . The  influence  of  the  sea  breeze 
is  illustrated  by  the  frequency  distribution  of  thunderstorms  along  the 
Gulf  Coast  and  Florida  (Scoggins,  1976).  Small-scale  convergence  in- 
duced by  gravity  waves  (Wave  CISK)  appears  to  be  important  to  intense 
convection  from  theoretical  considerations  as  well  (Raymond,  1976) . 

Even  diffusion  in  a two-constituent  medium  mi^t  be  a trigger  (Schaefer, 

1975) . A consequence  of  the  foregoing  discussion  is  reflected  in  the 
2 

overall  low  R or  relatively  small  amount  of  variance  of  thunderstorm 
occurrences  that  cetn  be  explained  by  the  linear  combination  of  synoptic 
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I 

parameters . 

I Table  12  cxintains  the  maximum  for  a specific  number  of  predictors 

in  each  data  subset.  The  regression  for  the  random  sample  of  no-thunder- 

I 2 

I storm  observations  produced  the  highest  R , 0.332,  most  likely  because 


Table  12.  Suinnary  of  statistics  for  regression  analyses  with 
different  data  subsets. 


Total 

Sample 

Size 

Occurrence 

Frequency 

(%) 

Data 

Max  R^ 

Number 

of 

Predictors 

Mean 

Squared 

Error 

7125 

17.9 

Total 

0.243 

20 

0.114 

2203 

40.7 

Random  18% 
of  no  TSTM 
days 

0.332 

18 

0.163 

2376 

13.9 

North  wind 

0.284 

13 

0.086 

4750 

20.6 

South  wind 

0.238 

21 

0.125 

1837 

8.1 

April 

dependent 

0.255 

14 

0.056 

1759 

25.4 

July 

dependent 

0.279 

16 

0.138 

the  total  number  of  observations  decreased.  The  equation  for  predicting 
thunderstorms  which  developed  between  2000  and  2300  GMT  following  sur- 
face wind  with  a northerly  component  at  1800  GMT  accounted  for  28.4%  of 
I the  total  variance,  whereas  the  south  wind  equation  accounted  for  23.8% 

though  some  of  this  difference  would  be  due  to  the  larger  occurrence 
i frequency  in  the  south  wind  data.  Further,  the  north  wind  equation  did 

I its  job  with  a fewer  number  of  predictors. 

Ihe  R^  for  the  April  and  July  data  are  based  on  fewer  observations, 

I 

I 
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2 

but  it  is  interesting  to  note  that  the  R for  April  (0.255)  is  lower 
than  that  for  July  (0,279)  even  though  the  frequency  of  thunderstorm 
occurrence  is  much  higher  in  July. 

The  mean  squared  error  (MSE)  of  the  regression  analyses  continued 
to  be  reduced  as  more  variables  were  added  to  the  model.  This  indicates 
that  the  exact  synoptic-scale  measures  of  the  conditions  for  thunder- 
storms were  not  avail^UDle,  or  the  parameters  did  not  truly  represent 
these  conditions.  This  result  is  not  surprising  if  one  considers 
the  crudeness  of  our  measures  of  atmospheric  structure  in  terms  of 
limited  horizontal  and  vertical  resolution,  the  untimeliness  of  the 
upper-air  measurements  (8-11  h before  thunderstorm  occurrence) , and 
limitations  imposed  by  the  specific  observed  variables  from  which  para- 
meters were  computed. 

Another  way  to  evaluate  quality  is  to  consider  how  well  predicted 
probabilities  represent  actual  frequencies  of  occurrence  of  thunder- 
storms. Predicted  probabilities  in  10%  increments  were  generated  for 
several  different  data  subdivisions.  These  are  shown  in  Fig.  9.  In 
general  there  was  a slight  tendency  to  overpredict  the  observed  prob- 
ability at  low  probabilities  ^md  underpredict  for  probabilities  ^d^ove 
0.6.  This  seems  to  be  consistent  with  our  natural  bias  in  subjectively- 
derived  probabilities.  Underprediction  at  high  frequencies  of  occur- 
rence can  be  explained  by  the  decreased  slope  of  the  regression  plane 
owing  to  the  many  more  non-occurrence  observations  compared  to  thunder- 
storm occurrences  (see  paragraph  e) . No  explanation,  however,  is 
apparent  for  the  overprediction. 


64 


e.  Performance  on  test  data  scui^le 

A further  measure  of  the  quality  or  goodness  of  the  regression 
models  is  how  well  the  ec[uations  perform  on  an  independent  data  sample. 
Equations  developed  from  a dependent  san^le  were  applied  to  independent 
data.  Furthermore,  a threshold  of  predicted  probability  was  chosen  and 
a contingency  t^^ble  of  counts  for  predicted  and  observed  yes  eind  no 
cases  developed  as  follows: 


Forecast 


1 

Yes 

No 

Sum 

Yes 

A 

1 

B 

No 

C 

D 

®2 

Sum 

^3 

^4 

T 

From  such  a table  some  typical  discriminates  can  be  examined  such  as  the 
overall  percent  correct,  (A+D)/T)100;  the  percent  of  correctly  forecast 
observations  of  an  occurrence,  called  prefigurance,  (A/Sj^)100;  the  per- 
cent of  correctly  observed  forecasts  of  occurrence  (postagreement) , 
(A/Sj)!©©;  Threat  Score,  A/A+B+C  (Charba,  1977)  called  critical  success 
index  by  Donaldson  et  al.,  (1975);  s)cill  score  [ (A+D)  - + S^S2)/T)/ 

[T  - S^S2)/T],  discussed  by  Brier  2uid  Allen  (1952);  and  V-score, 

V > (AD  *■  BC) / (A-t-B)  (C+D) , presented  by  Dobryshman  (1972)  and  discussed 
by  Woodcock  (1976) . The  threat  score,  skill  score,  or  percent  correct 
cannot  be  interpreted  to  measure  relative  merits  of  each  subdivision  of 
the  o^^iginal  data  because  each  Is  a function  of  the  observed  probability 
of  occurrence  (called  trial  conditions  by  Woodcock  (1976) ) . These 
probabilities  change  as  the  threshold  of  predicted  probabilities  for 


classification  purposes  changes.  Table  13  contains  example  contingency 


Table  13.  Contingency  tables  for  25  no-thunderstorm  forecasts  shifted 
to  the  yes  forecast  column  in  different  observed  proportions 
Threat  score  and  skill  score  are  also  shovm. 


(a)  Original  proportion 
(20%  forecast  yes) 


(b)  10%  proportion 

(3  yes;  22  no) 


(c)  15%  proportion 

(4  yes;  21  no) 


Forecast 


108  656 


No  130  634 


71 

51 

129 

635 

.29 

SS  - 0.340 


0.280 
SS  = 0.318 


TS 

SS  = 0.326 


(d)  20%  proportion 

(5  yes;  20  no) 


(e)  30%  proportion 

(8  yes;  17  no) 


(f)  49%  proportion 

(12  yes;  13  no) 


72  1 50 


Yes  1 No 


Yes  I 75 


125  I 639 


79 

43 

121 

3 

.356 


SS 
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tables  where  a fixed  number  of  no-thunderstorm  forecasts  (in  this  case 
25)  are  shifted  to  the  yes  column  for  different  proportions  of  observed 
yes  euid  no  cases . This  is  exactly  what  is  done  when  the  threshold 
probability  is  lowered.  One  can  see  that  the  threat  score  (TS)  or 
s)iill  score  (SS)  exceeds  the  original  values  only  after  the  proportion 
within  the  observed  categories  exceeds  the  original  forecast  probcibili- 
ty.  They  appear  to  be  unsuitable  for  a goodness  measure.  The  overall 
percent  correct  also  is  not  very  meaningful  because  of  the  many  days 
when  no  thunderstorms  occur.  The  V-score  is  least  affected  by  trial 
conditions  but  also  involves  the  No-No  entry.  Therefore,  our  discussions 
will  focus  primarily  on  the  prefigurance  and  postagreement  percentages. 
Table  14  contains  the  eibove  discriminates  for  each  data  subdivision. 

One  can  obtain  an  indication  of  the  deterioration  of  the  equations 
by  looking  at  the  decrease  in  any  of  the  discriminates  but,  in  partic- 
ular, the  V-score  between  the  develojxnental  cind  test  scuiqpies.  For 
example,  the  mean  V-score  for  the  total  developmental  Scimple  is  0.454. 

The  mean  for  the  total  test  Scunple  is  0.390.  The  lower  score  means 
poorer  performeince . 

Thunderstorms  appear  to  be  more  predictable  from  synoptic  para- 
meters when  the  surface  wind  has  a northerly  component  at  1800  GMT. 

Such  em  implication  is  indicated  by  the  greater  V-score,  prefigurance, 
and  postagreement  percentages  for  the  north  wind  equation  when  tested 
on  the  independent  sample  compared  to  similar  statistics  for  either  the 
total  equation  or  that  for  the  south  wind  subdivision.  There  are 
several  explanations.  First,  thunderstorms  frequently  develop  behind 
a shallow  surface  cold  front  (north  wind  component)  in  the  area  of 


J 


Table  14.  Summary  of  cont;.ngency  tables. 
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this  study.  In  these  situations  the  storms  are  usually  connected  with 
I a synoptic-scale  vertical  motion  field  that  results  from  positive 

vorticity  advection  due  to  a short-wave  trough  aloft,  given  that 

• moisture  and  potential  instability  exist.  Storms  also  can  develop 

I along  a surface  cold  front  which  trails  an  active  squall  line.  In 

these  cases  as  well,  the  surface  winds  behind  a southeastward  moving 
I squall  line  are  likely  to  have  a northerly  component  2-5  h before  the 

occurrence  of  the  cold-front  cells.  Finally,  we  can  distinguish  between 

• thunderstorms  in  continental  air  masses  where  surface  winds  are  from 

I the  North  eind  maritime  air  masses  with  southerly  winds . Thunderstorms 

occurring  in  the  maritime  air  are  more  frequently  classified  as  con- 
I vective,  air-mass  thunderstorms  (Beers,  1945) , The  trigger  mechanism 

for  releasing  the  instability  usually  present  is  less  detectable  from 
■ synoptic  data.  Mesoscale  or  even  smaller  discontinuities  may  exist  and 

I contribute  to  the  trigger.  These  elude  detection  from  the  data  in 

this  study. 

I When  applied  to  a random  dependent  sample  (in  other  words  how  well 

can  the  linear  function  discriminate  between  thunderstorms  and  no 
I thunderstorms  within  the  dependent  sample  which  only  includes  17%  of 

I all  no  observations  ) , the  equation  produced  prefigurance  and  post- 

agreement  percentages  of  85  and  77,  respectively,  although  the  overall 
I percent  correct  was  down  to  76  (Table  14,  p.  67) . The  deterioration  when 

applied  to  a random  independent  sample  was  not  large.  For  a threshold 
I of  0.46,  76  and  80%  were  obtained  for  the  prefigurance  and  postagreement, 

I respectively.  When  the  equation  developed  from  the  random  dependent 

sample  (17%  of  no- thunder storm  observations)  was  applied  to  the  total 

I 


69 
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independent  sample  (as  opposed  to  the  random  independent  sample) , 
the  prefigurance-postagreement  percentages  were  not  as  high. 

It  is  not  clear  why  the  equation  from  the  rauidom  dependent  sample 
deteriorates  so  little  when  applied  to  the  independent  sample . One 
possible  explanation  could  be  due  to  the  binary  nature  of  the  dependent 
variable  eind  unequal  distributions  of  occurrences  amd  nonoccurrences . 
Figure  10  shows  how  the  influence  of  the  nonoccurence  observations. 


INDEPENDENT  VARIABLE  (x) 


Fig.  10.  Regression  lines  for  data  in  Tables  3 and  4. 


actually  decreases  the  slope  of  the  least  squares  estimate  of  a regres- 
sion line.  By  eliminating  the  no's  so  that  the  proportions  are  more 
nearly  equal,  the  slope  is  increased.  This  increase  can  be  visualized 
as  an  increase  in  the. discriminating  ability  of  the  independent 
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variables.  As  the  slope  of  the  regression  line  increases,  a small 
change  in  x would  produce  a large  change  in  the  predicted  probability 
(Y)  if  the  linear  function  were  to  be  used  in  a predictive  fashion. 

Several  attempts  were  made  to  accomplish  the  same  result  by  using 
critical  values  of  predictors.  These  values  were  selected  from  frequency 

i 

distributions  of  the  predictand  and  leading  predictors . One  such  fre- 
quency distribution  is  shown  in  Fig.  11.  There  a cut-off  would  be 
5 g kg  ^ for  the  surface  mixing  ratio.  Others  were  chosen  similarly 
and  used  in  conjunction  (logical  and)  and  disjunction  (logical  or) 
operations.  An  exeunple  of  the  latter  would  be  as  follows:  If 

W < 5 g kg  or  0 < 317  K or  KI  < -8  or  W < 5 g kg  then  delete  this 

e y 

observation  (hopefully  it  will  be  a no-thunderstorm  observation) . In  fact 

the  above  statement  provided  the  best  results  which  could  be  obtained. 

2 

Frequency  of  occurrence  was  increased  only  7%  and  R changed  from  0.260 
to  0.247  for  a stepwise  procedure. 

f . Comparison  with  other  results 

Except  for  the  work  of  Charba  (1977)  there  are  no  other  results 
which  are  directly  comparable.  Charba  has  published  his  results  for  a 
similar  statistical  technique  (step  up)  and  for  2-  to  6-h  forecasts 
of  thunderstorms  (defined  similarly  from  MOR  data) . His  research  area 
includes  most  of  the  eeistern  United  States,  and  predictand  area  is 
about  80  km  on  a side.  However,  Charba  used  combinations  of  radar 
observations  at  1735  GMT,  radar  climatology,  surface  observations  at 
1500  GKT,  and  upper-air  forecasts  valid  at  2100  GMT  from  a limited- 


area,  fine-mesh  model  (Howcroft  and  Desmarais,  1971)  as  predictors 
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If  we  exclude  radar  predictors,  his  top  four  were  (1)  a modified^^  K 

index,  (2)  moisture  divergence  at  the  surface,  (3)  modified  Total- 

Totals  Index,  and  (4)  500-inb  wind  speed.  These  compare  favorably  with 

the  moisture,  divergence,  and  stability  parameters  from  observations 

in  this  study.  The  observed  frequency  of  thunderstorms  in  Charba’s 

2 

work  was  10%  compared  to  17%  here.  One  should  see  an  increased  R 

from  this  influence  in  Charba's  result  counteracted  to  some  extent  by  a 
2 

reduction  in  R due  to  the  smaller  forecast  area.  The  net  result  was 
2 

an  R of  0.282  in  Charba's  scheme  compared  to  0.284  in  the  case  of  ovu: 
north-wind  equation.  In  addition,  Charba's  predicted  prob^d^ilities 
were  similaur  overall  to  those  in  this  research. 

Some  knowledge  is  required  of  how  forecasters  subjectively  predict 
thunderstorms  in  an  operational  environment.  Unfortunately,  there  are 
no  statistics  which  would  exactly  correspond  to  the  areas,  times,  and 
procedures  used  here.  In  fact,  any  verifications  of  thunderstorm  fore- 
casts with  different  lead  times  are  difficult  to  find.  One  set  of 
data  was  available  for  14  base  weather  stations  in  or  near  the  area  of 
Fig.  4 (p.  32)  during  the  June,  July,  and  August  1976  period.  These  data 
consist  of  warnings  issued  by  forecasters  of  impending  thunderstorms. 

The  number  issued  emd  the  number  verified  with  a lead  time  is  sunmarized 
in  Table  15.  Thunderstorms  which  occur  less  than  1 h from  the  forecast 


Modified  in  this  context  means  that  surface  observations  of 
temperature  and  dew  point  at  1800  GMT  were  averaged  with  a forecast 
temperature  and  dew  point  at  850  mb. 
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Table  15.  Contingency  table  of  observed  and  forecast 
thunderstorms  for  14  base  weather  stations 
near  the  area  outlined  in  Fig.  4. 


Forecast 


31.8% 


time  are  counted  as  misses.  For  example,  a warning  for  thunderstorms 
issued  at  1700  GMT  valid  for  the  period  1900  to  2300  GMT  would  be  a 
hit  if  a thunderstorm  were  observed  at  the  station  or  within  the  base 
environment  after  1900  GMT.  Otherwise,  it  would  be  a miss.  The  base 
environment  is  usually  about  a 10-km  radius  of  the  station  but  may  vary 
up  to  45  km.  This  is  stilJ  considerably  smaller  them  the  forecast 
area  of  a square  138  km  on  a side  used  in  this  study  and,  therefore, 
should  reflect  poorer  performance.  On  the  other  hand,  a 1-h  lead  time 
is  allowed  for  verifying  the  weather  warnings,  whereas  the  lead  time  is 
2 h for  the  statistics  in  Appendix  B.  This  may  compensate  to  some  ex- 
tent for  the  smaller  area.  Many  other  differences  exist  between  these 
statistics  and  those  presented  in  Appendix  B so  that  comparisons  are 
difficult.  The  very  definition  of  thunderstorms  is  different.  An  MDR 
code  of  four  or  greater  was  used  in  this  research.  The  weather  station 


74 


used  their  observation  log  or  the  radar  in  a qualitative  sense.  Also, 
the  issue  time  from  the  weather  station  was  not  constrained  to  1800  GMT. 
Finally,  the  period  for  the  base  weather  station  included  a year,  1976, 
and  month,  August,  that  were  not  available  in  this  study.  Nevertheless, 
the  low  prefigurance  and  postagreement  percentages  of  46%  cind  32%, 
respectively,  seem  to  be  typical  of  a forecaster's  performance  at  this 
difficult  task. 

Though  there  is  little  confidence  in  comparisons  of  verification 
measures  applied  to  data  of  this  nature,  there  are  a combination  of  en- 
couraging signs  which  lead  to  a conclusion  that  observations  of  key 

parameters  in  linear  combination  can  provide  useful  forecasts  of  thun- 

2 

der storms  in  areas  of  eJaout  8350  km  for  periods  of  2-  to  5-h.  First, 
parameters  selected  by  statistical  methods  provide  the  ingredients  for 
subsequent  thunderstorms  which  have  been  deduced  from  many  years  of 
experience.  Secondly,  the  equations  do  not  deteriorate  when  applied  to 
independent  samples.  Further,  contingency  tables  produced  from  equations 
for  many  different  threshold  probeOailities  provided  higher  prefigurance 
and  postagreement  percentages  than  those  from  a table  of  actual  perfor- 
mance. Also,  predicted  probfJsilities  from  the  equations  represent 
actual  occurrence  frequencies.  Finally,  these  results  are  very  similar 
to  those  from  an  operational  program  where  forecast  model  predictors  had 
been  used.  Results  of  a principal  component  analysis  are  discussed 
next. 

g.  Dimensionality 

As  stated  earlier,  eigenvectors  of  the  independent-variable  matrix 


that  consist  of  sums  of  squares  and  cross  products,  (X'X) , can  be 
interpreted  to  represent  the  peirt  of  the  total  variance  accounted  for 
by  the  given  linear  combination  of  variables  where  the  eigenvector 
elements  are  the  weights  or  coefficients.  If  it  turns  out  that  the 
first  few  components  account  for  some  large  percentage  of  the  total 
varieuice  as  shown  by  the  cumulative  portion  of  the  eigenvalues,  then 
it  can  be  assumed  that  there  is  evidence  of  the  true  dimensionality  of 
the  original  set  of  variables  or  tliat  there  is  an  indication  of  the 
total  number  of  hypothetical,  latent  variables  needed  to  describe  the 
structure  of  the  original  variables.  This  is  another  way  of  quantifying 
the  degree  of  intercorrelation  cunong  the  x's.  These  eigenvectors  for 
different  subsets  of  the  (X'X)  matrix  are  shown  in  Table  16.  Also 
shown  are  the  associated  eigenvalues  and  cumulative  portion  of  the 
total  varieuice  which  is  accounted  for  by  each  successive  eigenvector. 

In  the  case  of  moisture  parcimeters , we  can  account  for  nearly 
90%  of  the  total  variance  in  all  moisture  parameters  by  using  the  first 
five  components  (the  five  largest  eigenvalues) . We  can  account  for 
50%  of  the  total  with  just  two.  T)»e  variables  which  seem  to  be  most 
important,  according  to  the  sum  of  the  first  two  eigenvector  co- 
efficients, are  surface,  850-,  amd  700-mb  mixing  ratio,  equivalent 
potential  temperature  at  the  surface,  and  dew-point  depression  at  700 
mb.  It  is  not  surprising  that  among  these  are  the  leading  parameters 
selected  by  the  stepwise  regression  procedure. 

Stability  parameters  have  fewer  dimensions  as  shown  by  the  eigen- 
vectors. Just  one  principal  component  accounts  for  59%  of  the  total 
variance.  The  90%  point  is  reached  with  only  four  eigenvectors.  Anong 
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the  first  two  components  those  important  variables  seem  to  be  equivalent 

potential  temperature  at  700  and  850  mb  and  the  Total-Totals  Index. 

If  we  consider  all  eigenvectors,  the  top  five  parameters  are  0 -0  , 

®5  ®8 

static  stability  index,  0 , Total-Totals  Index,  and  differential 

68 

thickness  (DTK) . All  stability  parameters  are  highly  intercorrelated 
and  there  really  should  not  be  many  dimensions  when  they  are  considered 
together. 

Principal  components  for  trigger  parameters  indicate  that  the 
trigger  mecheuiism  is  difficult  to  identify  from  these  parameters.  The 
cumulative  variance  does  not  reach  50%  until  the  third  eigenvector  (com- 
pared to  first  for  stability  and  second  for  moisture)  and  90%  is  not 
reached  until  eigenvector  seven  (not  shown  in  Table  16,  p.  76,  as  we  stop 
at  six  eigenvectors).  Here,  important  parameters  are  vertical  motion  at 
the  top  of  the  surface  layer  (this  includes  terrain  induced  vertical 
motion) , surface  divergence  of  moisture,  and  integrated  moisture  di- 
vergence from  850  to  300  mb. 

Finally,  all  predictor  parameters  can  be  considered  together.  This 
case  is  summarized  in  Table  17  where  only  eigenvalues  and  cumulative 
variance  are  shown.  With  five  principal  components  one  could  account 
for  50%  of  the  total  variance  among  all  parameters.  Seventeen  components 
could  account  for  92%.  Therefore,  it  seems  justifiable  to  use  at 
least  five  variables  in  discriminant  models  and  possibly  up  to  17.  The 
radar  predictor  was  not  included  in  this  analysis. 

h . Operational  utility 

It  is  rather  fortuitous  for  individual  weather  station  application 
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Table  17.  Eigenvalues  and  cumulative  portion  of  total  variance  accounted 
for  by  each  successive  eigenvector. 


Eigenvector 


12  3 4 


Eigenvalue 

9.677 

3.032 

2.698 

2.055 

Cumulative 

Portion 

0.276 

0.363 

0.440 

0.499 

Eigenvalue 

10 

1.026 

11 

0.997 

12 

0.965 

13 

0.918 

Cumulative 

0.742 

0.771 

0.798 

0.825 

Portion 


5 

6 

7 

8 

9 

2.012 

1.777 

1.336 

1.218 

1.149 

0.556 

0.607 

0.645 

0.680 

0.713 

14 

15 

16 

17 

18 

0.872 

0.864 

0.761 

0.734 

0.653 

0.849 

0.874 

0.896 

0.917 

0.936 

that  none  of  the  more  complicated  (from  a computational  standpoint) 
parameters  were  chosen  among  the  top  few  predictors.  In  a five-vari- 
cible  equation  one  would  have  only  to  evaluate  the  moisture  divergence 
term.  In  order  to  do  this,  one  needs  to  plot  1800  GMT  mixing  ratios 
obtained  from  a skew-T  diagrcim  along  with  u and  v wind  components.  A 
forecaster  should  extract  values  of  (1)  the  product  of  u x W at  two 
east-west  grid  points  spaced  130  km  apart,  65  km  to  either  side  of  his 
station,  and  (2)  v x W at  two  similarly  spaced  north-south  grid  points. 

Negative  predicted  probabilities  are  possible  but  should  be  con- 
sidered as  zero.  Similarly,  probabilities  greater  than  one  should  be 
interpreted  as  one.  The  probability  threshold  for  a thunderstorm-no- 
thunderstorm decision  could  be  estimated  from  the  40%  postagreement 
percentages  in  the  contingency  tables  from  within  the  dependent  or 
total  samples.  The  best  estimate  for  either  the  total,  north  wind,  or 
south  wind  equations  is  about  0.28.  This  would  optimize  prefigurance 
at  the  expense  of  "crying  wolf"  and  total  percent  correct.  Of  course, 
this  cut-off  would  be  shifted  toward  lower  probabilities  when  a cnti  ■* 
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(in  terms  of  costs  Involved)  task  was  involved. 

Die  probjibilities  can  be  used  directly  and  the  operator  should  be 
encouraged  to  use  these  in  conjunction  with  cost  analyses.  If  the  costs 
of  protective  action  amd  loss  potential  for  inaction  are  known,  then 
the  occurrence  probabilities  can  be  used  in  objective  cost-loss  algo- 
rithms (Murphy,  1976) . 

Operational  equations  should  be  developed  in  given  areas  with  all 
data  available.  Since  this  study  was  undertaken,  an  additional  year  of 
data  has  been  collected.  New  equations  should  incorporate  all  days  for 
which  predictor-predictand  samples  eure  available  emd  should  be  applied 
to  the  subsequent  year.  So  long  as  a few  (five  or  six)  predictors  are 
used,  weather  station  forecasters  within  the  development  area  and  for 
the  particular  predictor-predictemd  times  could  use  the  equations  di- 
rectly for  estimating  the  pr(^2d>ility  of  thunderstorms.  More  conqplicated 
equations  which  incorporate  extensive  analysis,  2uul  transformed  pre- 
dictors would  be  applied  to  current  data  at  facilities  with  computer 
processing  capability.  Probabilities  could  be  transmitted  to  appropriate 
locations.  This  latter  procedure  is  currently  employed  by  the  National 
Weather  Service.  (See  National  Weather  Service  Technical  Procedures 


Bulletin  194.) 

The  following  five-variable  equations  developed  from  the  1974-1975 
sample  can  be  tested  with  current  data  and  probabilities  evaluated: 
Total  PY  - 0.0181  0.0185«W  + 0.414*NDRP  - 0.00278* (^'WV) 


0.00569* (T-T.)_  - 0.00515*0  - 0 ) 

d 7 e7  00 


North  wind  PY  > 1.028  * 0.00337*W  0.358*MI»P  - 0.00336* (^’WV) 


- 0.00374* (T-T.)_  - 0.00373*6 

0 7 ey 


(8) 


(9) 
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South  wind  PY  - 0.0655  •••  0.427*MDRP  + 0.0194*W  - 0.00265* (^'WV) 

(lo: 

- 0.00583* (T-T.)-  - 0.00406* (T-T.)- 

Co  C 7 

Coefficients  from  these  equations  are  valid  for  the  following  units  of 
measure  for  predictors:  H (g  kg  ^) ; MDRP  (zero  or  one  for  no  precip 
or  precip):  V (m  s ^) : T,  0,  (K) ; Ax  > Ay  > 1.3  (m)  in  the  moisture 

divergence  calculation.  Predicted  probabilities  would  apply  to  loca- 
tions within  the  developmental  area  (Fig.  4,  p.  32)  and  are  valid  with 
1800  GMT  surface  or  1200  GMT  upper-air  observations.  Thunderstorm  prob- 
abilities (PY)  would  apply  to  the  area  shewn  in  Fig.  6 (p.  39)  with  re- 
spect to  the  forecasting  station  and  during  the  period  2000  to  2300  GMT. 

Performance  in  terms  of  prefigurance  amd  postagreement  percentages 
of  a binary  (yes  or  no)  forecast  could  be  e:q>ected  to  be  sli^tly  lower 
than  the  65%,  40%  obtained,  respectively,  with  equations  containing 


more  than  15  predictors 
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6.  UPPER-AIR  OOMDITIONS  AT  3-h  INTERVALS 

On  one  day,  24  April  1975,  upper-air  data  were  avail2d>le  at  3-h 
intervals.  These  were  collected  as  part  of  the  Fourth  Atisospheric 
Variability  Eiqperinent  (AVE  IV)  sponsored  by  the  National  Aeronautics 
and  Space  Attainistration  (NASA).  Analyzed  fields  of  tenperature,  height, 
dew  point,  and  wind  conponents  from  a 158-lcm  grid  spacing  for  49  grid 
points  and  four  levels  were  utilized  in  a test  to  determine  changes 
of  correlations  and  predictors  at  different  times  with  occurrences  of 
thunderstorms  at  2000-2300  GMT.  Analysis  procedures  are  described  Isy 
Puelberg  (1976) . 

TWenty-one  candidate  predictors  were  calculated  for  each  grid  point 
at  1200  OfT,  1500  GMT,  and  1800  GMT.  The  predictand  was  the  highest 
MDR  value  (converted  to  binary)  in  an  area  equlvad.ent  to  a 138-)cm  box 
surrounding  the  grid  point  as  in  previous  wor)c  euid  for  any  time  during 
the  period  2000  to  2300  GMT.  Again,  variable  selection  techniques  were 
used  to  choose  subsets  of  predictors.  Stepwise  procedures  provided 
the  first  several  predictors;  all  possible  regressions  were  considered 
in  the  selection  of  variables  four  through  six.  A stepdown  or  l>ac)cward 
elimination  procedure  was  used  for  those  models  beyond  six  variables. 
Separate  regression  analyses  were  performed  for  each  period,  and  the 

same  candidate  predictors  as  discussed  earlier  were  available  to  each. 

2 

Maxtmai  R achieved  for  each  model  from  a one-variable  model  up  to  a 
model  with  all  21  variables  is  shown  in  Fig.  12.  As  expected,  most  of 
the  explained  variance  was  obtained  with  the  first  three  variables. 

Mhat  is  surprising  is  that  the  1800  GMT  predictor  time,  which  is  closest 
to  the  time  for  which  the  forecast  is  made,  did  not  provide  a clearly 
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Fig.  12.  Fractional  aaount  of  total  variance  in  thunderstora  occurrence 
accounted  for  by  nuabers  of  predictors  and  a coadtiination  of 
selection  procedures. 

superior  equation.  With  sodels  including  the  leading  one  and  two 

predictors,  is  highest  for  1200  GWr  and  lowest  for  1500  GMT  though 

2 

the  differences  are  sli^t.  Naxitasa  R of  0.874  was  achieved  for  the 
all-variable  aodel  with  1500  GMT  data,  whereas  the  ■sxiaasa  R^  seesa  to 
reach  a plateau  beyond  ten  variables  for  the  1800  GMT  period.  Mo  oosh- 
pletely  satisfying  ei^lanation  is  apparent  for  the  lack  of  isprovesant 
as  the  predictand  tisa  is  approachsdi  however,  there  are  scae 
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posslbllltias.  As  pointed  out  In  Section  2,  the  assusptions  inherent 
in  an  analysis  of  this  type  are  not  fulfilled,  niese  errors  may  be 
preventing  the  measure  of  true  correlations.  Secondly,  this  was  one 
day  for  which  there  were  only  49  observations,  and  many  of  these  were 
not  independent. 

On  this  day  most  of  the  thunderstorm  activity  was  associated  with 
two  squall  lines.  As  shown  in  Fig.  13,  the  first  group  of  cells  was 
dissipating  and  moving  southeastward  between  1200  and  1500  GMT.  At 
1800  GMT  there  were  few  echoes . The  second  line  became  active  after 
2100  GMT.  One  may  hypothesize  that  there  were  different  atsiospherlc 
environments  created  by  the  occurrence  or  nonoccurrence  of  convection 
at  many  of  the  49  points  for  each  time.  Similarly,  a discontinuity 
existed  across  the  area  in  the  form  of  a stationary  front  shown  in 
Fig.  14.  Such  a feature  cosqplicates  the  interpretation  of  results  for 
all  points  as  each  is  considered  an  independent,  separate  c^servatlon. 
For  example,  temperature  may  be  iiiportant  to  thunderstorm  developmsnt 
in  the  area  behind  (in  the  cool  air)  the  front,  but  its  influence  may 
be  masked  by  the  many  observations  in  the  warm  air  where  it  may  not 
be  important  at  all.  Finally,  the  response  of  the  atmosphere  to  the 
synoptic-scale  parameters  is  being  measured.  There  nmy  be  different 
response  times  for  different  parameters.  Zt  is  possible  that  those 
upper-air  features  at  1800  GMT  to  which  the  atmosphere  responds  most 
exist  on  a horizontal  and  vertical  scale  smaller  than  can  be  resolved 
from  our  data. 

Table  18  contains  the  predictors  selected  during  each  of  the  three 
periods.  Up  to  the  five-variable  model  all  antecedent  predictors  are 
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included.  For  example,  the  best  five-variable  model  with  1200  GMT 
data  includes  u-con^nent  wind  shear,  differential  temperature  advec- 
tion,  static  stability  index,  mid-level  convective  instability,  and  the 
dew-point  depression  at  700  mb.  After  five  predictors  different  Veiri- 
ables  are  chosen,  some  of  which  were  not  selected  up  to  that  point. 

Again,  the  particular  variables  selected  beyond  five  should  not  really 
be  discussed  since  these  are  undoubtedly  more  a function  of  the 
particular  selection  technique  than  amy  physical  mechanism. 

The  first  few  variables  included  in  the  model  can  be  discussed  in 
that  these  variables  in  linear  combination  aure  most  highly  correlated 
to  subsequent  thunderstorms  on  24  April  1975.  The  difference  between 
the  u-wind  component  at  500  amd  850  mb  is  iiqx>rtant  at  the  earlier  two 
times.  This  term  is  related  to  the  meaui  horizontal  teo{>erature  gradient 
in  the  layer  between  850  and  500  mb  insofar  as  the  winds  are  geostrophic. 
Differential  temperature  advection  between  850  and  500  mb  is  also  am 
iBfx>rtamt  term  as  it  is  among  the  top  two  predictors  for  all  times . 
Temperature  aulvection  probably  was  an  important  mechamism  for  creating 
the  instability  on  this  day.  It  is  interesting  to  note  that  the  u-compo- 
nent  wind  shear  was  the  first  variable  selected  for  the  model  at  both 
1200  and  1500  GMT  observation  times,  whereas  It  is  fourth  at  1800  GMT. 

This  may  be  a consequence  of  the  environmental  influence  of  thunder- 
storms present  at  the  earlier  times  but  almost  totally  absent  at  1800  GMT. 
From  an  energy  study  of  this  day,  Fuelberg  (1976)  found  strong  conver- 
sion of  potential  to  Icinetlc  energy  associated  with  intensifying 
convection.  The  maximum  conversion  was  at  400  mb.  A selection  of 
different  variables  measured  at  different  times  or  the  same  variables 
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in  different  order  could  also  be  a result  of  differences  in  atmospheric 
response  to  dynamic  as  opposed  to  thermodyncimic  parameters . More 
work  needs  to  be  done  in  this  area. 

In  sumnary,  the  linear  combination  of  upper-air  parameters  computed 
from  variables  measured  2 to  5 h before  the  predictand  time  on  24  April 
1975  did  not  explain  more  of  the  variance  of  thunderstorm  occurrence  at 
2000  to  2300  GMT  than  those  measured  8 to  13  h before.  Also,  there  were 
differences  in  parameters  selected  at  the  different  times.  Differential 
temperature  advection  was  important  at  all  times . The  vertical  wind 
shear  of  the  east-west  wind  consonant  w^ls  less  importcuit  to  subsequent 
intense  convection  when  the  former  was  confuted  from  1800  GMT  measure- 
ments conqpared  to  this  parameter  measured  at  1200  or  1500  GMT.  These 
results  may  be  a consequence  of  environmental  influences  of  convection 
at  the  earlier  two  times,  since  little  convection  was  apparent  at  1800 
GMT.  Ihey  also  might  result  from  violations  of  model  assun^tions . 
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7.  SUMMARY  AND  CCNCLUSIONS 

Surface,  upper-air,  and  radar  observations  analyzed  onto  a 65-kin 
grid  were  used  exclusively  to  develop  equations  which  relate  predictors 
to  subsequent  thunderstorms  by  classical  statistical  euid  pareune ter i ration 
techniques.  Particular  attention  was  devoted  to  minimizing  errors 
which  result  from  violations  of  model  assumptions.  Raw  data  were  pro- 
cessed to  preserve  as  much  detail  as  can  be  justified  from  the  original 
spacing  of  observing  stations . Every  fourth  point  from  a 16  x 16  array 
was  included  to  reduce  the  spatial  correlation  naturally  present  in 
meteorological  data.  Variable  selection  techniques,  plots  of  model 
residuals,  and  principal  component  cinalyses  were  used  to  reduce  the 
multicollineaurity  present  among  independent  variables.  Finally,  several 
different  statistical  procedures  were  used  to  cross-check  and  confirm 
results. 

Specific  synoptic  parameters  believed  to  be  related  to  intense 
convection  were  calculated  from  analyses  at  1200,  1700,  and  1800  GMT 
and  used  as  candidate  predictors  in  a stepwise  varied^le-selection  pro- 
cedure. Surface  and  upper-air  data  were  tested  separately.  The  pre- 
dict2md  was  the  occurrence  or  nonoccurrence  of  an  MOR  code  of  four 

or  greater  (assumed  to  represent  thunderstorms)  in  an  area  of  about 
2 

8500  )im  surrounding  a grid  point  during  three  subsequent  time  combina- 
tions. The  best  time  was  the  period  from  2000  to  2300  GMT  so  that  only 
this  combination  was  used  in  further  analyses. 
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The  equations  were  found  to  be  stable^^  when  applied  to  test  data. 
Also,  they  contained  reasonable  parameters  as  predictors  and  produced 
results  in  contingency  tables  coinparable  with  present,  subjective 
techniques  eund  with  other  statistical  procedures.  Predicted  values 
from  developmental  and  test  san^les  represented  actual  thunderstorm 
frequencies  of  occurrence.  This  technique  can  be  used  to  forecast 
thunderstorms  in  cm  operational  environment.  Furthermore,  thunderstorms 
can  be  predicted  with  greater  success  with  this  scheme  when  the  surface 
wind  has  a northerly  component  at  1800  GHT. 

While  not  impressive  alone,  upper-air  data  seemed  to  add  an  impor- 
tant ingredient,  namely  stability,  which  is  not  available  from  surface 
data.  Radar  echoes  present  at  emd  before  the  forecast  time  also  added 
an  inportant  dimension.  MDR  code  greater  than  one  near  1700  GHT  C2m 
lead  to  MOR  of  four  or  greater  between  2000  £md  2300  GMT  due  to  diurnal 
effects,  or  a high  MDR  initially  might  tend  to  persist  in  space  and 
time.  In  any  case,  this  radar  predictor  indicates  the  presence  of 
vertical  motion,  a recognized  trigger  mechanism.  Neither  time  nor  space 
derivatives  as  computed  in  this  study  were  particularly  important 
predictors  with  the  notable  exception  of  moisture  divergence.  But  the 
surface  mixing  ratio,  occurrence  of  antecedent  precipitation,  con- 
vergence of  moisture,  and  stability  were  chosen  to  be  among  the  top 
five  predictors  in  every  case.  A reason  for  the  poor  showing  of  other 

^^Stable  in  this  context  isaans  that  statistics  in  both  the 
developmental  and  test  data  sample  are  nearly  the  same. 
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derivatives  was  that  the  small-scale  gradients  In^rtant  to  Intense 
convection  cannot  be  measured  due  to  data-resolutlon  constraints  from 
fixed  observation  networks. 

It  was  found  from  both  the  stepwise  procedure  and  principal-compo- 
nent analyses  that  linear  equations  should  Include  from  five  to  17 
variables  when  parameters  represent  observed  surface  and  upper-air 
features.  Furthermore,  measures  of  the  trigger  mechanism  were  found 
to  be  most  difficult  to  define  from  data  In  this  study,  whereas  moisture 
parameters  were  easily  defined. 

Equations  with  many  variables  will  produce  slightly  better  results 
In  terms  of  preflgurance  and  postagreement  discriminates.  Reasonable 
values  to  expect  would  be  65%  euid  40%,  respectively. 

Finally,  parameters  from  upper-air  observations  at  1800  GMT  on 
24  April  1974  were  not  more  highly  correlated  to  thunderstorms  In  the 
period  2000-2300  GMT  than  were  parameters  from  observations  at  1200  or 
1500  GMT.  This  result  may  be  a consequence  of  the  small  statistical 
saa^le,  violations  of  assumptions  In  the  statistical  analysis  and  the 
organized  development  and  movement  of  two  groups  of  thunderstorms.  One 
group  Influenced  observations  from  which  parameters  were  calculated  at 


1200  and  1500  GMT 
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8.  SUGGESTIONS  FOR  FURTHER  RESEARCH 

In  a study  of  this  scope  and  magnitude  there  eure  practical  restric- 
tions on  the  amount  of  data  to  be  handled,  numbers  of  predictors  used, 
and  types  of  processing  to  be  performed.  It  is  believed  that  this 
research  remained  within  these  constraints  without  sacrificing 
scientific  thoroughness  and  accuracy.  Nevertheless,  these  limitations 
and  results  of  the  investigation  itself  provide  several  suggestions 
for  future  research. 

a) . In  order  to  capture  some  of  the  true  mesoscale  features  of 
the  atstosphere,  synchronous  meteorological  satellite  data  should  be 
used.  Mesoscale  wind  fields  determined  from  satellite  cloud  observa- 
tions might  be  important  predictors  of  severe  weather  (Houghton  and 
Wilson,  1975) . Time  and  space  derivatives  of  equivalent  black  body 
tesqperatures  might  reveal  small-scale  features  which  lead  to  subsequent 
thunderstorms.  A microwave  sensor,  such  as  that  flown  on  the  NASA 
satellites,  would  provide  indications  of  soil  moisture.  Albedo  might 
be  important  as  well.  Some  preliminary  experiments  with  regression 
procedures  and  the  ATS-3  satellite  data  by  Sikula  amd  Vender  Haar  (1972) 
indicated  satisfactory  results  when  the  dependent  variables  were  ceilings 
and  visibilities  and  independent  variables  were  satellite  radiances. 

Even  conventional  data  available  from  several  mesoscale  networks 
such  as  HXPLBX  (Scoggins  and  Wilson,  1976) , NSSL  (Fanlchauser,  1969) , 
and  MBTROMEX  (Changnon  a^. , 1971)  could  be  used  in  this  type  of 
study  to  determine  trtiat  additional  information  about  subsequent  thunder- 
storms is  available  for  a few  areas.  Several  thunderstorm  seasons 


must  be  used,  however 
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b)  . Severe  thunderstorms  might  be  predicted  from  statistical 
procedures  by  use  of  upper-air  winds  inferred  from  satellite  thickness 
(and  geopotential  height)  calculations.  Areas  of  jet  streams  auid 
difluence  aloft  could  be  identified  emd  related  to  severe  weather. 

Digital  radar  data  now  available  at  several  locations  (Muench,  1976) 
could  be  used  as  well  as  additive  data  from  present  MDR  reports  in  con- 
junction with  severe  weather  prediction. 

c)  . Different  predictors  from  conventional  data  could  be  tested. 

For  example,  present  weather,  past  weather,  visibility,  wind  gusts, 
sky  conditions  and  remarks  are  available  from  surface  observations. 
Climatological  frequencies  of  occurrence  for  thunderstorms  could  be 
computed  from  all  avail2d3le  thunderstorm  data  2md  these  used  as  pre- 
dictors as  well.  Use  of  upper-air  data  should  be  expanded  to  include 
all  the  resolution  in  the  present  observation.  Zn  addition,  time  changes 
for  upper  air  parameters  might  be  tested.  Trajectories  of  key  parameters 
might  smke  important  predictors.  The  K Index  emd  TTl  could  both  be 
updated  by  using  the  temperature  and  moisture  from  1800  GMT  surface 
observations  averaged  with  those  observed  at  650  mb  12  h earlier. 

d)  . The  area  for  predictor  selection  should  be  allowed  to  vary 
and  predictand  area  reduced.  The  reduction  in  correlation  due  to 
reduced  size  of  predictand  might  be  compensated  for  by  parameters  from 
sMller-scale  data  sources  selected  from  different  areas. 

e)  . More  work  on  the  timeliness  of  upper-air  data  is  required. 
Additional  days  when  3-h  data  are  available  should  be  used  to  obtain  a 
more  adequate  sample.  Siadlarly,  further  research  into  the  time  changes 
of  surface  and  upper-air  reports  should  be  performed  to  determine 
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atmospheric  response  times  (in  terms  of  producing  intense  convection) 
for  various  physical  processes  such  as  differential  advections. 

f )  . Further  work  on  air  mass  stratifications  would  be  fruitful . 

One  might  use  combinations  of  ten^erature,  wind  2md  moisture  to  identify 
three  or  four  types  of  air  masses.  Five  years  of  digital  radar  data 
will  be  available  for  this  type  of  work  after  the  1977  season. 

g)  . We  should  continue  to  investigate  random  saa^^ling  or  other 
ways  of  reducing  the  many  nonoccurrence  days.  A forecaster  is  not 
concerned  with  predicting  thunderstorms  on  the  many  days  that  he  is 
confident  there  will  be  none. 

h)  . There  should  be  more  investigation  into  verification  techniques 
for  this  type  of  data. 
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APPENDIX  A 

ANOVA  for  selected  regressions 
(1)  Total  Equations  (7  predictors) 


Source  Degrees  of  Svsn  of  Squares  Mean  Square  F Value 
freedom 


R 


2 


Model 

7 

239.94357522 

34.27765360  294.91  0.22481796 

Error 

7118 

827.33582417 

0.11623150 

Corrected  7125 

1067.27939938 

Total 

Parameter 

Units 

estimate 

Standard  error 

Intercept 

- 

3.16740081 

- 

MDIV 

gg  s X 10 

-0.00276198 

0.00020761 

W 

-1 

gg 

25.43248796 

1.64332634 

MDRP 

(1  or  0) 

0.39082300 

0.01794391 

CSIL 

K 

-0.00512986 

0.00069915 

®e7 

K 

-0.01096999 

0.00097647 

«7 

gg~^  X 10^ 

0.05462544 

0.00395796 
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(2)  Northwind  eqiiation  (7  predictors) 


Source 

Degrees  of 
freedom 

Sum  of  Squares 

Mean  Square 

F Value 

r2 

Model 

7 

78.07938810 

11.15419830  127.84  0.274014 

Error 

2371 

206.86722812 

0.08724894 

Corrected  2378 

284.94661623 

Total 

Parameter 

Units 

£ estimate 

Standard  error 

Intercept 

- 

1.19350345 

0.30434664 

NDIV 

gg  s X 10 

-0.00303301 

0.00034990 

(T-T^)^ 

K 

-0.00430722 

0.00084002 

W 

-1 

99 

69.09513072 

6.42057583 

MDRP 

0 or  1 

0.36068872 

0.02911004 

T 

K 

-0.01807394 

0.00359192 

T-T. 

d 

K 

0.02365490 

0.00360711 

e 

•7 

K 

-0.00485763 

0.00101146 
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(3)  Random  Equation  (6  predictors) 


Source 

Degrees  of 
freedom 

Sum  of  Squares 

2 

Mean  Square  F Value  R 

Model 

6 

161.91260308 

26.98543385  155.31  0.291224 

Error 

2268 

394,05926505 

0.17374747 

Corrected 

Total 

2274 

555.97186813 

Parameter 

Units  ^ estimate  Steuidard  error 

Intercept 

- 

1.57191166 

0.46848081 

MOIV 

gg  s X 10 

-0.00340850 

0.00039607 

(T-T^)7 

K 

-0.00789130 

0.00129024 

W 

-1 

gg 

46.44806192 

3.21385797 

TTl 

K 

0.01232299 

0.00144379 

0 

K 

-0.00650511 

0.00150835 

®7 

MDRP 

0 or  1 

0.24007083 

0.02659094 
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(4)  Total  Equation  (20  predictors) 


Source 

Degrees  of 
freedom 

Sum  of  Squares 

Mean  Square 

F Value 

Model 

20 

259.37113846 

12.96855692 

114.05  0.24302084 

Error 

7105 

807.90826092 

0.11370982 

Corrected 

Total 

7125 

1067.27939938 

Parameter 

Units 

^ estimate 

Standard  error 

Intercept 

- 

8.91002397 

- 

0 

e 

K 

0.00762649 

0.00105337 

MDIV 

-1  -1  ,r.8 

gg  s X 10 

-0.00228346 

0.00023036 

0 A 
e 

Ks‘^  X 10^ 

0.00006796 

0.00002784 

-2  ,„12 
mb  m X 10 

0.00008997 

0,00005411 

c 

s ^ X 10^ 

0.00036562 

0.00023066 

-1  -2  ,„15 

gg  m x 10 

0.00009605 

0.00002797 

KI 

K 

0.01872137 

0.00316708 

DTH 

M mb  ^ X 10^ 

-0.00127867 

0.00055703 

(T-T^)^ 

K 

0.01376441 

0.00341045 

(T-T^)q 

K 

0.02245995 

0.00355782 

m 

0.00262488 

0.00048169 

W 

-1 

gg 

10.97990075 

3.09197109 

MDRP 

(0  or  1) 

0.39041505 

0.01793046 

V 

-1 

m s 

-0.00397307 

0.00112285 

CSIM 

K 

-0.15500734 

0.02892617 
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I 

I (4)  Total  Equation  (20  predictors)  (Continued) 


Pareuneter 

Units 

£ estimate 

Standard  error 

STS  I 

2 -1 
m sg 

0.01596813 

0.00284876 

00 

CD 

K 

-0.03357575 

0.00484179 

e 

67 

K 

-0.00633453 

0.00142566 

u 

gg  X 10 

0.14166814 

0.01732946 

u 

-1 

m 8 

-0.00565405 

0.00146472 
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(3)  Southwind  dependent  equation  applied  to: 


Cut  off  Southwind 

independent  data 


Southwind 
dependent  data 


0.25 

Yes 

No 

Yes 

No 

Yes 

238 

176 

Yes 

407 

181 

No 

259 

1211 

No 

521 

2003 



0.27 

Yes 

No 

Yes 

No 

Yes 

223 

191 

Yes 

372 

216 

No 

210 

1260 

No 

425 

2099 

0.30 

Yes 

No 

Yes 

No 

Yes 

185 

229 

Yes 

330 

258 

No 

136 

1334 

No 

311 

2213 

0.33 

Yes 

No 

Yes 

277 

311 

No 

226 

2298 
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